Week of 090921

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Ricardo, MariaD, Lola, Simone, Harry(chair), Graeme, DavidG, Gang);remote(Xavier/FZK, Gareth/Ral, Michael/BNL, INFN-T1, Kyle/OSG).

Experiments round table:

  • ATLAS - (GS): SARA had a short unscheduled SRM outage on Saturday night and there was minor degradation at RAL and ASGC over the weekend both quickly fixed. Have just heard there is an unscheduled LFC downtime at ASGC timestamped at 14.00 UTC today.

  • ALICE -

Sites / Services round table:

  • FZK - (XM): Tomorrow FZK will upgrade their worker nodes dcache clients at the request of LHCb going to SRM 1.9.2-4 and dcache 1.9.3.

  • RAL - (GS): Had problems with their batch system scheduler over the weekend - being investigated (the batch system was rebuilt with the move to SLC5 last week). This morning there was a successful SRM update for ATLAS and tomorrow morning there will be a test of the UPS in their new building from 8 to 10 BST (7 to 9 UTC) and as a precaution they are taking CASTOR down for this period.

  • CERN CASTOR - (RS): LHCb SRM will be changed tomorrow morning from using gridftp external to gridftp internal.

  • CERN CC - (OB): A prewarning of work on the chiller installation at CERN in the beginning of October. The new chillers will be connected to a mains power supply so for a time the CC will be running on non-redundant chiller configuarations so at risk. The works will take place during the whole week-ends of 3 October and 10 October.

  • CERN Networking - (DG): There will be a maintenance on the CERN link to PIC today from 15:00 to 17:00 UTC with a 30 minute cut then tomorrow morning a maintenance on the CERN-RAL link with loss of service from 07:00 to 09:00 UTC.

  • ASGC - (GQ): The problem stopping migration to tape of CMS files has been identified as an incorrect setting in some tape cartridges which stopped the data streaming for some time. Will have a downtime for power maintenance on Sunday 27 September.

AOB:

  • IN2P3 - (HR): A reminder that IN2P3 electrical power upgrades start tomorrow till 24th September. They clarified that user access to dcache data will stop during the whole period but incoming transfers from CERN will continue except for a 2 hour period on the morning of 22nd (tomorrow).

  • OSG - (HR): HR mistakenly informed Kyle that Maria Dimou was in Barcelona. In fact she goes on Wednesday for an USSAG meeting and did join todays meeting. In fact Kyle had no OSG items to raise so the next OSG report will be next Monday.

Tuesday:

Attendance: local(Gavin, Ricardo, Miguel, Harry, Lola, Gang, Simone, MariaD, Olof);remote(Xavier/FZK, NL-T1, Graeme/ATLAS, John/RAL, Michael/BNL, INFN-T1).

Experiments round table:

  • ATLAS - (GS): 1) SARA lost their SRM again last night having instability problems with dcache. Second time in a week - reported to developers. 2) ASGC having problems with tape migration in that the tape pool space token ran out of space - see GGUS ticket 51688. 3) IN2P3 is in downtime for an electricl power upgrade so has been temporarily removed from ATLAS site services. 4) A test operator alarm ticket was sent to BNL. It was quickly closed but did not raise a matching OSG alarm. MariaD volunteered to follow this up.

  • CMS reports - Some highlights in brief (EGEE'09 conference in progress). 1) CMSSW tags publishing issues, and permission problems in some T2/T3 sites (for CMSSW_3.2.6). 2) transfer issues in /Prod: Brunel T2 -> FNAL T1; CSCS T2 -> all T1s; Bari T2 -> CNAF T1; Omaha T3 -> FNAL. 3) slow responsiveness of some sites (few Russian T2's, Brasil UERJ, ..) - following up. 4) jobs stuck in scheduling phase at RWTH T2. 5) SAM tests issues affecting T2 sites: SAM analysis test failure at T2_US_UCSD.

  • ALICE -

  • LHCb reports - 1) CERN CASTOR LHCb upgrade to 2.1.8-12: non transparent intervention downtime 9:00-13:00. In the same slot also an upgrade to CASTOR Gridftp internal configuration that lcg_utils/gfal now supports for LHCb. 2) Intervention on CERN LFC: Upgrade to LFC version 1.7.2-4. Agreed for tomorrow morning at 10 (profiting of this inactive period of LHCb week in Florence). 3) CNAF: A problem with all data (intermittently) missing from StoRM fixed quickly on Saturday evening. 4) IN2p3: banned due to dCache upgrade + electrical power intervention. 5) SARA: banned due to network intervention outage.

Sites / Services round table:

  • NL-T1: The SRM crash was in fact the third that week. Debug has been sent to the developers and will be followed up in a dcache conf call later today. They have just finished a scheduled downtime for a network maintenance. Nikhef had network problems on a SAN server yesterday which cut off their bdii information system server.

  • RAL : UPS tests this morning completed successfully.

  • ASGC: We had a cream CE online today; migration able to triggered normally but lacking of proper media to migrate the selected data files, we are still investigating this.

  • CERN CASTOR: LHCb upgrade and move to using internal gridftp completed this morning.

  • CERN Infrastructure: The chiller installation reported yesterday with at-risk on 3 and 10 October will in fact be at-risk for the whole of this period.

AOB:

Wednesday

Attendance: local(Gang, Dirk, Ricardo, Olof, Simone, Harry(chair), MariaG);remote(Michael/BNL, Angela/FZK, Jon/RAL, Graeme/ATLAS).

Experiments round table:

  • ATLAS - (GS): 1) Experiment cosmics data is now being exported to the Tier 1 sites. 2) Some remaining problems (inefficiencies) writing to tape at ASGC. Gang noted that there had been a lack of hardware but that tape migration at ASGC had been running at 180MB/sec since the last 30 minutes. 3) Transferring some older cosmics data from RAL to ASGC is giving errors at source.

- (SC): ATLAS have noticed that the version of lcg-utils on lxplus (the slc4 service) is that of 1.6 which has a known bug that you cannot run srmcp between two castor instances. The 1.7 version is under the 'new' directory in afs and ATLAS would like this to become the default. Ricardo will follow this up.

  • ALICE -

  • LHCb reports - 1) LHCb have requested at CERN to arrange an intervention upgrade (~1h) on the SRM service (moving to 2.8 that properly supports xroot TURLs, and operationally has substantial improvements to the logging which should aid in debugging any problems on the service). This quiet week could be a good time slot. 2) Intervention on the LFC is over but authentication errors appear in the read-only LFC server log (under investigation).

Sites / Services round table:

  • RAL (JK): ATLAS SRM daemon died this morning. Has been tracked down to a bug. They have put a workaround in and there will be a patch in the next srm release. This caused a lot of SAM test failures.

  • CERN UI (RS): The CERN glite afs UI will be upgraded on 5 October which will give ATLAS the required version of lcg-utils.

  • CERN CMS(OB): cms grid pool accounts (CMSxxx) are being extended to 999 today. This will allow the CERN CRAB server to be restarted.

  • CERN Databases (MG): A service incident report on the failure of streams replication of the ATLAS conditions from the CERN offline to the Tier 1 sites has been prepared at https://twiki.cern.ch/twiki/bin/view/PDBService/StreamsPostMortem . This was from Monday 08.00 to 18.00 and was caused by the capture process aborting and the monitoring system tried to send an email alert which failed due to sendmail being overloaded. The failure was not detected by the DB on-call person but was seen by ATLAS who sent a priority mail to the streams expert rather than to the standard Remedy support line. Follow up has been to ensure the on-call person regularly checks the streams replication and that ATLAS understand the correct communications flow.

Release report: deployment status wiki page

AOB: Dirk has sent round an email of the SIR on the RAL incident when CASTOR disk-to-disk copies failed during a planned upgrade to their nameserver and suggests to discuss with RAL the possibility of their performing rolling nameserver upgrades.

Thursday

Attendance: local(Eduardo, Ricardo, Gang, Harry(chair));remote(Gareth+Brian/RAL, Angela/FZK, Michael/BNL).

Experiments round table:

  • ATLAS -

  • CMS reports - 1) link commissioning activities for T2-T2 links for the Higgs physics groups started. 2) CMSSW tags publishing issues, and permission problems in some T2/T3 sites (for CMSSW_3.2.6). 3) Transfer issues in /Prod: Caltech -> CNAF, Nebraska -> KIT. 4) SAM tests issues affecting T2 sites: SAM analysis test failure at T2_US_UCSD; and more...

  • ALICE -

  • LHCb reports - It was agreed to proceed with CERN SRM migration to version 2.8 at 14:00 today. Gavin confirmed successful completion.

Sites / Services round table:

  • ASGC (by email): Received lots of alarms for the power cycle of server nodes in data center and it's clear that most critical oracle servers have been affected (split into two chassis), this including CASTOR DB, LFC, 3D, FTS and access to the front end services ceased around 20:18 UTC. Flagged as an unscheduled downtime ending at 22:19 UTC. Any resulting experiment tickets will be checked.

  • RAL: 1) Have been seeing problems with their batch system, especially the Maui/Torque server, since migrating to SL5. They are running the 32-bit server version and would like to know if any other site has experience of running the 64-bit version to which they could then migrate. Svp replies to Gareth.Smith@rlNOSPAMPLEASE.ac.uk 2) Following the email from CERN DM group reminding sites to kindly fill in a service incident report, for which a template was provided, for any unscheduled database service interruptions of more than 4 hours it was confirmed this applies to all baseline and experiment critical services.

  • BNL: 1) Have seen an LFC acl problem at the AGLT2 (Great Lakes) Tier 2 last night when reprocessing started. The problem was corrected 2 hours after it started to happen. 2) There was also a problem overnight related to FTS proxy delegation. This will be looked at since it should no longer happen.

  • CERN Infrastructure (GM): 1) CMS pool accounts have been extended from 199 to 999 as requested for the CMS CRAB analysis service. 2) ATLAS srm has been upgraded to version 2.7.19. 3) LHCb SRM has been upgraded to 2.8.0.

AOB:

Friday

Attendance: local(Ricardo, Gang, Harry(chair), Simone);remote(Alexei/ATLAS, Michael/BNL, Gareth+Brian/RAL, Angela/FZK).

Experiments round table:

  • ATLAS - (AK): Are running reprocessing jobs from ESD data and have started data distribution between Tier 1's and between Tier 2's. ATLAS have a point for IN2P3 which has just come back from a three day scheduled downtime followed by two days of at-risk when ATLAS could not, in fact, run jobs. There is another scheduled downtime next Monday to Wednesday for the dcache chimera migration and rather than restart production activities now to have them stopped on Monday ATLAS propose to restart only after the chimera migration.

  • CMS reports - 1) link commissioning activities for T2-T2 links for the Higgs physics groups IN PROGRESS. 2) transfer issues in /Prod: Estonia -> RAL, CSCS -> FNAL, UCSD -> FNAL, {Florida,CIEMAT} -> CNAF. 3) massive analysis job failure in CIEMAT: being investigated. 4) SAM tests issues affecing T2 sites: SAM analysis test failure at T2_US_UCSD; and more...

  • ALICE -

Sites / Services round table:

  • BNL: Round about noon yesterday one of the four arms of a tape library broke and was repaired about 4 hours later. Hence some 25% of the robots tape inventory was innaccessible for 4 hours. This slowed down reprocessing but only a bit as the requests are ordered so that no error is thrown to the applications which hence just stall until tape data becomes available again.

  • RAL: Batch problems after their SL5 migration continued yesterday with ATLAS submitting a ggus ticket at the end of the afternoon. This morning an update was applied to bring the server and worker node clients to the same level and the service is running much better.

  • FZK: Have a problem that newly installed worker nodes cannot be brought back online to the batch system so currently down to about 50% of their worker node capacity.

  • ASGC: Problem with tape migration for CMS has been identified as some wrongly configured cartridges in their tape pool. This has been corrected and migration restarted a few hours ago.

  • CERN Databases: Streaming of conditions data from CMS online to offline stopped due to a big amount of very large transactions. Now restarted and under investigation with CMS.

AOB:

-- JamieShiers - 2009-09-17

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2009-09-25 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback