-- HarryRenshall - 06 Jun 2008

Week of 080609

Open Actions from last week:

Daily CCRC'08 Call details

To join the call, at 15.00 CET Tuesday to Friday inclusive (usually in CERN bat 28-R-006), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Simone, Roberto, Maria, Jamie, Nick, Harry);remote(Gonzalo).

elog review:

Experiments round table:

  • ATLAS (Simone): FDR exercise not yet finished. RAW data shipped Friday (after validation of merging step). Night of Fri - recons should have been done. 1st reconstructed data only started flowing 3-4pm Sat. Computing model + data replicated on disk at CERN etc. Problem in merging of raw files. AOD and ESDs rather pointless - meeting today at 4. Probably remerge, re-export, re-recons & re-export(!). Means export exercise 'pointless' from internal ATLAS point of view. Re-recons will be at Tier0, not Tier1s. Re-recons at Tier1s in following weeks. Need some s/w changes to permit this.

  • LHCb (Roberto): last week quite. This week - restart full chain. Not clear when. As soon as confirmation from DIRAC experts that all new features ready. More news tomorrow - briefing tomorrow morning. Most likely restart full nominal rate activities later this week.

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

  • Internal post-mortem on recovering from power cut. Affected new h/w - on mixed power config. 1/2 on physics power, 1/2 on critical. All Fibre channel + network should be on critical. Power cut should just downgrade performance but not impact availability. RAC5 worked as expert - services reallocated to working nodes. RAC6 - much more complex - severe downtime. Many problems. CMS: 1 hour downtime, 3 hours reduced. LCG: 2 hours down, 2 hours v bad performance. LHCb: 1 hour down. RAC6 - servers + network switches went down. Wrongly connected to physics power. General public network switch down. Internal switch - all cluster fails over to single node. CMS survived on single node. LCG survived but badly due to heavy load. LHCb survived on 1 node. LCG RAC - problem with LCG SAM - wrongly defined to be run on nodes which went down after power cut(!). Unavailability even longer. Moving services to node that was up - human error, further downtime. Hence 2 hours down, 2 hours poor performance. Many lessons learned - 1) wrongly connected to power 2) SAM wrongly assigned without failover 3) bug in automatic startup procedures of OEM agents which gave problems when services restarted. Conclude: also not called at time of power cut. To be reviewed...

Monitoring / dashboard report:

Release update:

AOB:

Tuesday:

Attendance: local(Roberto, Simone, Maria, Jamie, Harry);remote().

elog review:

Experiments round table:

* LHCb (Roberto): briefing with DIRAC3 devs this morning - when to restart CCRC? Conclusion: to restart in ~1 week need intermediate version with just some of features foreseen for data taking. Full list in PASTE meeting of this morning. Prefer version similar to that of May to avoid long interruption in Grid activities and long ramp-up thereafter. Most likely restart next week. Analysis of logs done - presentation by Nick this morning. LFC issue? Not seen during run - check logs. Generic message "no valid credentials found". AMGA - drop for new book-keeping service accessing directly Oracle.

  • ATLAS (Simone): FDR data distribution to be redone (as described yesterday). First data should come tonight, reconstructed etc. Finish end week.

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

  • Continue to have highload on CMS RAC from dashboard app. Affecting also other applications. Should be addressed before startup(!). Had to kill sessions today. Need alternative contacts.

Monitoring / dashboard report:

  • Need contacts - see above.

  • SAM / GridView tests for LHCb - discrepancies to be followed up.

Release update:

AOB:

Wednesday

No call today - see GDB.

Thursday

No call today - see CCRC'08 post-mortem workshop.

Friday

No call today - see CCRC'08 post-mortem workshop.

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2008-06-10 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback