Week of 130701

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: Simone - SCOD, Xavier - IT-DSS, Maarten - Alice, Maria Dimou - GGUS, Adrian - Dashboards
  • remote: Vladimir - LHCb, Torre - ATLAS, Lisa - FNAL, Pepe - PIC, Saverio - CNAF, Wei-Jen - ASGC, David - CMS, Rolf - IN2P3, Tiju - RAL, Rob - OSG, Onno - NL-T1

Experiments round table:

  • ATLAS reports (raw view) -
    • T0/Central services
      • HLT P1 farm: FYI, for last couple of weeks the resources in P1 are used for GRID production, approx 12k parallel jobs, efficiency 95%
    • T1
      • RAL-LCG2: DDM errors due to Castor problems on Fri, downtime over weekend, cloud set brokeroff, downtime ended Sunday when problems were resolved, restored to production. Tail of failures due to recovery jobs. GGUS:95160
      • IN2P3-CC DATADISK : the site does not publish that it is full. Checking consistency of SRM DB and ATLAS DDM available space accounting. GGUS:95204

  • CMS reports (raw view) -
    • Very quiet weekend -- some MC production.
    • GGUS:95174 Possible CVMFS black hole node at T1_IT_CNAF.
    • Site availability from CMS site config issues Weds should now be corrected.
    • Xavier for CMS: the observed 12Gb/s limit (vs 20 Gb/s expected) from the CMS HLT farm to the CERN computer center (particularly the storage) does not look like a network problem (iperf tests show 19.something Gb/s). CMS will investigate from their side.

  • ALICE -
    • CERN: last Thu many thousands of "ghost" jobs were found keeping job slots occupied, due to aria2c Torrent processes not exiting after their 15 minutes of lifetime. It is still not known what would have caused this change of behavior (no relevant changes known on the ALICE side), but the matter also has had an impact on the routers connecting the WN subnets: their CPU usage went up by a lot since the evening of Fri June 21, with some dangerously high spikes at times. To mitigate the situation for the weekend, Fri late afternoon the ALICE LSF quota was reduced from 15k to 7500 jobs and a 7k cap was applied on the ALICE side as well, just to be sure. During the weekend we tested and deployed a patch for the ALICE job wrapper that now kills any such processes explicitly. This workaround looks sufficient to avoid the problem.

  • LHCb reports (raw view) -
    • Incremental stripping campaign in progress and MC productions ongoing

Sites / Services round table:

  • PIC: started the migration of the WN to SL6. Migrated 20% of the farm, available for the experiments. Migration should finish July 15.
  • RAL problem at the CASTOR stager on Friday, down from Friday afternoon to Sunday morning. On a side note, a CASTOR NameServer upgrade is foreseen on Wednesday (downtime information available in GOCDB)
  • OSG: holiday in the US on Thursday. No attendance from OSG.
  • GGUS: Release on 2013/07/10. NB! The GGUS certificate used to sign ALARM email notifications will change. Details in Savannah:138390

AOB: none

Thursday

Attendance:

  • local: AndreaS, Eddie, Przemek, Maarten/ALICE, Luc/ATLAS, Steve
  • remote: Dave/CMS, Ronald/NL-T1, Vladimir/LHCb, Rolf/IN2P3-CC, Gareth/RAL, Wei-Jen/ASGC

Experiments round table:

  • ATLAS reports (raw view) -
    • T0/Central services
      • ATLAS services degraded. Hardware failure on afs240 server GGUS:95431. Fixed.
      • RAL pointed out bug in CVMFS 2.1.11. Fix 2.1.12. Follow up with Task Force
    • T1
      • RAL-LCG2: DDM errors due to Castor problems. Tail of failures due to recovery jobs. GGUS:95160. Solved
      • FZK very low rate of transfers/subscriptions. Due to GPFS problem affecting 1/3 pools
      • IN2P3-CC Inconsistency in SRM database concerning ATLASDATADISK spacetoken. GGUS:95204. Fixed.
      • BNL jobs failing proxy GGUS:95332 misconfigured job wrapper. Fixed
      • BNL lost files. Recovery ongoing

  • CMS reports (raw view) -
    • Some MC production and prestaging data ahead of 2011 data legacy rereco at T1.
    • GGUS:95364, SAV:138483, SAV:138503, SAV:138502 -- SAM tests at CERN, ASGC failing due to cert change (resolution in progress)
    • GGUS:95358 CERN CE's ce201 and 205 causing SAM test failures yesterday -- now looks better on our end.
    • CERN VM hypervisor storage problem Tuesday -- affected some of the VM's used for production -- GGUS:95346 -- CMS glideinWMS frontend (back online yesterday)
    • GGUS:95337 Re: Savannah-->Bridging working one way & CMS therefore not always following the GGUS ticket after the bridge.
      • Contacted the proponents of the two tickets referenced, in one case happened to be just that he was waiting for his new cert in order to be able to reply in GGUS.
      • More generally we have reminded operators and shifters in CMS that they must follow the GGUS ticket once bridged. There currently are warnings of this fact when tickets are bridged.
      • In the short term cms-crc-on-duty@cernNOSPAMPLEASE.ch will now be CC'ed for all bridged tickets starting from the next release (Thanks Guenter!)
      • Ultimately we are moving to use GGUS exclusively for infrastructure/site tickets.

  • ALICE -
    • CERN: the LSF job quota has been gradually increased back to its nominal level of 15k, since no more issues related to aria2c processes were observed
    • KIT: the concurrent jobs cap has been at 5k for 1 day now, but high loads on the firewall are still seen occasionally; to be continued
    • RAL: apparently the problem with high network traffic due to aria2c processes was also observed at RAL; the AliEn version on the VOBOX has been upgraded around 14:00 CEST to avoid the problem for newly submitted jobs [Gareth adds that according to RAL's network experts the problem is fixed by the change and will send more details to see if the patterns are consistent. There is a parallel discussion on ALICE moving to CVMFS, which according to Maarten might be validated by ALICE in a few weeks, although a precise estimation is difficult.]

  • LHCb reports (raw view) -
    • Incremental stripping campaign in progress and MC productions ongoing
    • T0:
    • T1:
      • GridKa: Problem with staging was solved (GGUS:95135), still we have backlog of jobs.
      • GridKa: Pilots submitted through KIT brokers Failed(8%) or Cleared (70%) (GGUS:94462)
      • RAL: CVMFS problem (GGUS:95435).
      • CNAF: Problem with FTS transfers today. Fixed with roll back to the previous SRM release.

Sites / Services round table:

  • ASGC: ntr
  • IN2P3-CC: ntr
  • NL-T1: ntr
  • RAL:
    • Yesterday CASTOR's central services were successfully updated to version 2.1.13. We would like to upgrade the ATLAS stager next Wednesday [ok for ATLAS]; if successful, the other stagers would follow in a number of weeks because of holidays
    • ATLAS reported a disk server down
    • The CVMFS problem reported by LHCb should actually relate to two separate problems, one of them having just been fixed (see ticket)
  • CERN batch and grid services: ntr
  • Dasboards: ntr
  • Databases: ntr

AOB:

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2013-07-04 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback