LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek090615 (2010-06-11, PeterJones)

EditAttachPDF

Week of 090615

Week of 090615

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability				SIRs & Broadcasts
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	Broadcast archive

GGUS section

GGUS Escalation reports every Monday (used for WLCG Service Report to MB)
- Full search: https://gus.fzk.de/ws/ticket_search.php and select "VO" - this will give all tickets, including team & alarm
LHCOPN Tickets in GGUS: see https://gus.fzk.de/pages/all_lhcopn.php and change your selection criteria. Future actions are also listed.
- If network group participation is necessary, please invite them in time.

Procedure to become a LHC Experiment VO TEAM member
Other recent GGUS FAQs

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here

General Information
CERN IT status board	M/W PPSCoordinationWorkLog	WLCG Baseline Versions	Weekly joint operations meeting minutes

Additional Material:

STEP09	ATLAS	ATLAS logbook	CMS WLCG Blogs

Monday:

Attendance: local(Simone, Alessandro, Jean-Philippe, Jamie, Gang, Miguel, Eva, Ignacio, Shaun, Steve, Harry, MariaDZ, Julia, Roberto, Patricia, Dirk);remote( Gonzalo, Daniele, Gareth, Brian).

Experiments round table:

ATLAS - (Simone) no access detailed notes due to current Twiki outage (also difficulties for external people to join the phone conf). STEP closed for ATLAS on Fri, started run down at 9:00. All analysis jobs killed, prod jobs stopped, mostly off by 11:00 with some tails (~two days). Still network routing problems between CNAF and TRIUMF. 11 T2’s showed problems - most of them just run out disk space, causes for the remaining four T2s will be described in ATLAS post mortem (being prepared). Functional tests will restart on Thu. Data deletion will start tomorrow. Some contraints (eg RAL machine move). Shaun: will start with worker nodes on Wed. Castor should be available until Mon. Next week ATLAS will restart production activities again.

CMS reports - (Daniele) CMS miss their twiki too. Continued with T0 test in STEP (w/o ATLAS) with more than 1GB/s to tape. Backlog digested nicely digested by CASTOR over 12h period. Tier 1: 200MB/s achieved at RAL, T1 CPU efficiency studies ongoing. T1-T1 transfer tails almost finished, T1-T2 more details in a STEP report (being prepared). Analysis continues for a few more days.

ALICE - (Patricia) No issues for production over the weekend. Transfers successful with peak of 500MB/s during weekend. On Fri: 6 T1 have performed ok. Sat: problems at FZK (also today) with low (72%) efficiency. ticket submitted 49488 (gridftp restarts followed up with site). Now the effciency is increasing. Problems with FZK, are not show stoppers for ALICE as other sites took over. Due to this the distribution to T1s is not homogenous right now.

LHCb reports - (Roberto) Planned reprocessing with sqllght replicas (replacement due to Persistency/LFC problem) was stopped as sqlight replicas did not contain the correct magnetic field. LHCb therefore did declare the end of their STEP activities and will redo scalability for Persistency once a patch available. Otherwise MC activity ongoing with minimum bias events - 7000 jobs and no issues.

Sites / Services round table:

Gonzalo/PIC: network glitch of 1h from around 5:00 UTC.

Michel/GRIF: no issues

Eva/CERN: apply failures for atlas at several T1s caused by deadlock. LFC propagation problem to GridKA came back even after switch to archive log based replication. Being escalated to Oracle support.

Miguel/CERN: ALICE unavailability for 1h on Sat. A patch has been released by castor development and will be deployed to all instances on the next few days.

Gang/ASGC: sam showed low availability for ASGC during the weekend. Cause were too only few jobs for sam tests.

Julia/CERN: last week siteview was reinstalled - Julia will distribute the new link. Pages can be bookmarked now and the topology DB is used. Remaining issues with the display of T3 sites (not in the topology DB) are being worked on.

Gareth,Shaun/RAL: working through backlog problems at RAL

Brian/RAL: work on BNL to RAL channel to increase throughput (tcp window size optimisation).

AOB:

MariaDZ/CERN: Concerning the mechanism for notification of VO experts by trusted site admins (request from ATLAS), a suggestion is now available in the ticket (https://savannah.cern.ch/support/?108277#comment7). All VO’s should review the proposal and respond if interested in this with the corresponding e-groups / lists for source and destination .

Tuesday:

Attendance: local(Miguel, Gavin, Sophie, Olof, Alberto, Simone, Harry, Jamie, Jean-Phillipe, Gang, MariaDZ, Eva, Andrea, Nick, Patricia, Roberto, Diana, Markus, Dirk);remote(Gonzalo/PIC, Jeremy/GridPP, Fabio/IN2P3, Gareth/RAL, Brian/RAL, Angela/FZK ).

Experiments round table:

ATLAS - (Simone) Overview for this and next week: deletion of step data started at T0/1/2 , deletion of 5PB of data not expected to finish before end of the week. Next week: detector cosmic run, expect 300TB of cosmics at CERN, AOD and DPD of 20 TB each will be transferred, shares from step will be used. More detailed/confirmed numbers later this week. Tests thisweek: new LFC version in with bulk methods at TRIUMF (long rtt site to show improvement). Stress test for xroot at cern. meeting with atlas analysis people at CERN for t3 pool, test will include concurrent transfers. RAL will not participate in cosmic run (scheduled down time). ASGC plans for data center move? Gang: move will start June 3rd. Conclusion: ASGC will also not participate in cosmic run.

CMS reports - (Daniele) Twiki came back at 18:00 and CMS lost some documentation. STEP ramp down and preparation for post mortem continuing. Acknowledge help from dashboard team. T1-T2 transfer tests and analysis submission still going on for this week. Plan to increase analysis rate (so far read errors are dominant problem).

ALICE -(Patricia) production is going on smoothly. FZK problem is solved and the performance fine now. new ticket: 49514: transfers show as “ready”, but do not proceed - FTS team will follow up.

LHCb reports - (Roberto) Since yesterday LHCb restarted transfers and reconstruction at t1 (without coral credential from LFC - patch applied on lhcb side). Reprocessing, staging from tape, several mc productions with some 8k jobs running - no issues. Jeremy asks about jobs slots at T2: should we expect more jobs? Roberto: will check. Gonzalo: is the current LHCb activity still part of STEP09? Roberto: no, LHCb declared STEP finished on Fri.

Sites / Services round table:

Gonzalo/PIC: noticed during weekend failure with dcap access to local storage (load balancing config). Daniele confirms that CMS has seen this problem as well.

Fabio/IN2P3: cms reprocessing test w/o prestaging (need to to take into account that test was done without competition).

Angela/FZK: Tomorrow conditions DB for ATLAS and LHCb as well as CMS squid servers will be down (was there an announcement before? Re: Announced to local VO representatives on Monday - sent broadcast on Tuesday). Sporadic problems with CPU calculation for the information system - not yet understood. Also don’t understand reason for low efficiency periods, but suspect data access problems.

Eva/CERN: OS problems with RH5 for DB clusters understood - migration of integration and preproduction DB clusters can now take place, The problem with failing ATLAS apply processes at T1 sites is now understood. ATLAS has fixed their population scripts.

AOB:

Roberto - some degradation shown in SAM due to sensor problems (run out of disk quota), not due to service problems.

Gareth: clarification for RAL scheduled down time (via email): Wednesday 17th June - start WMS drain; Saturday 20th June start drain of CEs. Thursday 25th June: Stop all remaining services. Some key services will be restored that day, others on Friday 26th. Monday 6th July: Castor and batch system restart. Full details at: http://www.gridpp.rl.ac.uk/blog/2009/05/14/schedulemovenewbuilding/

MariaDZ: How to protect against twiki outage / data loss? To be followed up in CERN service meetings.

Wednesday

Attendance: local(Jamie, Julia, Gang, Harry, Antonio, Jean-Phillipe , Nick ,MariaDZ, Diana, Olof, Dirk);remote(John/RAL, Michael/BNL, Daniele/CMS, Angela/FZK, Fabio/IN2P3,).

Experiments round table:

ATLAS - (Simone) No issues to report.

CMS reports - (Daniele) CMS has declared their STEP activities over as of yesterday. Few T2 and T0 tests still ongoing but outside STEP. CMS is open to proposals for multi-VO tests.

ALICE - report will be added later today (Patricia). Both CREAM-CE are CERN are performing badly at submission time. more details included in the corresponding GGUS ticket: 49553 submitted this morning.

LHCb reports - report on the LHCb twiki linked.

Sites / Services round table:

John/RAL : batch farm down from 3-8am - cause under investigation.
Michael/BNL: successful upgrade of main switch fabric yesterday - very promising improvements for multi-cast traffic. dcache upgrade to 1.9.1-8: fixes issue with disabled pools, replicate on arrival, hsm cleanup registration, fixed issue in lru garbage collection, support for pin migration. BNL will move tomorrow to OSG site integration (eg goc DB , info system) - old links to egee services will be deprecated. MariaDZ: move may create issue with contact email for notification (for automated GGUS ticket forward). Michael: has been discussed with OSG and will follow up with MariaDZ offline. Simone: info from BNL will be in OSG bdii in OSG. will this be transparent? Michael: yes, should be.

Angela/FZK: Conditions DB intervention done. transparent intervention for LFC/FTS upcoming tomorrow. It appears that the fts notification email yesterday was not sent. Nick: please file a GGUS ticket. FZK had some PNFS outage today.

Fabio/IN2P3: some LHCb sites appear as banned in gridmap, but are usable. being investigated with LHCb
Michel/GRIF: long shutdown from tomorrow until Mon
Gang/ASGC: problem with use of fairshare during STEP now understood. Solution will be verified after CC move next month.
Harry/CERN: a large number of local accounts (3.5k) used for WLCG will have their passwords expire soon. Will force a small subset (dteam) accounts in the next days to study the impact.
Antonio: glite 3.2 released with UI on SL5. Important change of release process: future SL5 middleware releases will bypass pre-production deployment tests. This follows the proposed EGI deployment model and should be tested now (with new services)

AOB: (MariaDZ) Atlas and CMS please comment on https://savannah.cern.ch/support/?107531#comment27

Thursday

Attendance: local( Gang, Steve, Harry, Ignacio, Julia, Roberto, Andreas, MariaG, Simone, MariaDZ, Diana, Olof, Tony, Dirk);remote(Michael/BNL, John/RAL, Jeremy/GridPP, Angela/FZK).

The meeting started with a report about the power cut at CERN: below in service part

Experiments round table:

ATLAS - (Simone) ATLAS experts checking services after power cut. Still no contact to some voboxes. Ignacio: voboxes restart is still ongoing. As database stayed up during the power cut, ATLAS will now check that all DB front-ends have re-connected properly to their respective DB server.

CMS reports - No report today (as Daniele already announced yesterday). Julia: CMS dashboard and backend DB are ok, Mona Lisa service went down and has restarted automatically (approximately 1/2h loss of monitoring data). MariaG: online DB was not affected but CMS server with Oracle portal went down, which affected CMS operations.

ALICE - (will be added by Patricia ). ALICE considers the end of the STEP09 exercise also for them. CREAM-CE at CERN is not performing ok in none of their nodes. Ticket 49553 has been submitted yesterday morning. For the moment there are no progress in this ticket.

LHCb reports - (Roberto) Central DIRAC setup still partially down. Before the power cut: smooth reprocessing and full FEST reconstruction and MC production. Investigating best strategy for moving files between space tokens at CNAF: issue with path dependency on space token may require LFC intervention for renaming. Intermittent data access failures with root protocol at RAL are now understood and RAL has increased the service limit. LHCb is now checking if this resolved their issues. Open GGUS ticket for shared area access FZK. Angela: addressing the issue by mounting area on WN as read-only. Also issues with shared area access at CNAF: site will closely monitor GPFS as issue is currently gone. Concerning the recent question about the number/usage of job slots by Jeremy: LHCb has found an increased load due to more frequent heart beat exchanges which currently limits the DIRAC submission rate. LHCb will address this in order to use the available slots. Julia: what capacity would you like to have? Roberto: need to be able to fill up the system, but no concrete numbers yet.

Sites / Services round table:

Tony,Ignacio/CERN: two short power cuts around 12.15 and a longer one at 12.40. During the cut at 12.40 UPS batteries run out, just before the power was just back. Non-critical services in the CC were affected including the castor services for the LHC experiments. Services were restarted at 13.30 and came back normally. Castor/ATLAS, due to a problem with a switch after the restart, took some additional time. Production databases (castor and experiment DBs online and offline) where not affected by the power cut. A service incident report has been produced at https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090618. The reason for the power cut is not yet fully know. An official analysis is expected early next week.

Michael/BNL: Move to OSG membership done last night and the site is reporting and site availability now through OSG channels. ATLAS VO specific tests are executed consistently. Thanks to Maarten Litmaath for his help.

John/RAL: Successfully performed scheduled network intervention.

Angela/FZK: the email notification problem was understood: synchronisation between gic portal and goc db was switched off one week ago, now back.

AOB: (MariaDZ) The templates for submitting TEAM and ALARM GGUS tickets by email are now these: https://savannah.cern.ch/file/GHD_CreateAlarmMail_v2.txt?file_id=9556 and https://savannah.cern.ch/file/GHD_CreateTeamMail_v3.txt?file_id=9557 Please give these URIs to TEAMers and ALARMers in your experiment. We'll try to do it too but you know them better.

Friday

Attendance: local(Andrea, Gang, Patricia, Nick, Miguel, Ignacio, Jean-Philippe, Olof, Allessandro, Simone, Roberto, Dirk);remote(Michael, Derek, Jeremy, Angela).

Experiments round table:

ATLAS - (Allesandro) ASGC functional test ok now, RAL left out from functional tests due to scheduled down time. UK cloud will test backup solution in case T1 is down, eg Glasgow will get all AOD and DPD and act as temporary T1. Need cern FTS expert to implement this - can wait until Mon.

CMS reports -

ALICE - (Patricia) Alice task force meeting held on STEP outcomes. ALice now stopped transfers and FTS test. A detailed report is being prepared for the STEP post mortem workshop. Created two days ago ticket concerning cream (49553) for cern. Ignacio: Received ticket today and passed to expert. Alice will follow up. One ALICE production VO box is still dead after the power cut (vendor call) and will not be recoverable until next week. Alice is using the back-up box. Olof: Instructions for how to contact Alice should be added to operators manual. Alice will work on this.

LHCb reports - (Roberto) Remaining FEST and 1st pass reconstruction jobs running. Observed some crashes at several t1 sites. The root cause has not been found, as logging information is not available due to a full log disk. MC physics production still running smoothly with 5k jobs in the system. LHCb received a proposal to upgrade the LHCb SRM setup SRM srm-lhcb.cern.ch to 2.7-18. The suggested intervention on Tue would work for LHCb. Oracle based conditions DB has now been tested at scale. A report will be attached to the LHCb twiki. CNAF: sqlight problems on shared area due to NFS: will get rid off this problem with the next gaudi framework release, which copies the file locally.

Sites / Services round table:

Derek/RAL: batch system drain for scheduled outage starting.

AOB:

Andrea/CERN : A service incident report for the Persistency/LFC problems has been produced at https://twiki.cern.ch/twiki/bin/view/PSSGroup/LFCReplicaSvcPostMortem

-- JamieShiers - 11 Jun 2009

Topic revision: r12 - 2010-06-11 - PeterJones

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback