Week of 141208
WLCG Operations Call details
- At CERN the meeting room is 513 R-068.
- For remote participation we use the Vidyo system. Instructions can be found here.
General Information
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Web
Monday
Attendance:
- local: Alessandro (ATLAS), Christoph (CMS), Maarten (SCOD + ALICE), Nathalie, Pepe (PIC), Stefan (LHCb), Tsung-Hsun (ASGC)
- remote: Christian (NDGF), Dimitri (KIT), Gareth (RAL), Lisa (FNAL), Michael (BNL), Onno (NLT1), Rob (OSG), Rolf (IN2P3)
Experiments round table:
- ATLAS reports (raw view) -
- CentralService & Tier0/Tier1s
- BNL-ATLAS GGUS:110582 not really a site issue: it was due to the changes in the periodic remove algo used in the AutoPilotFactory.
- Daily Activity overview:
- not so much ongoing in the system for various reasons: 1) not really a lot of workload from MC and Derivation Framework coords, 2) some issues with Rucio (described in details below) which led to some tasks broken ( could be improved with more retries with exponential decay time of retry) 3) some Derivation Framework tasks were failing due to SW issues.
- CMS
- Trouble with Phedex data service over the weekend GGUS 110600
- Problems reading data from EOS for Tier-0 tests on Friday GGUS 110593
- ALICE -
- KIT: many raw data files are still being read remotely
- an inconsistency was found between the AliEn File Catalog and the SE inventory
- a consistency check is being prepared
- LHCb reports (raw view) -
- MC and user jobs. "Legacy Run1 Stripping" campaign running full steam and progressing well
- T0: Very low number of stripping job errors but CERN still shows a multiple error rate compared to other T1 sites, would like to understand this. Resubmission of the same file usually works (GGUS:110604)
- T1:
- IN2P3: DT tomorrow, draining tonight
- RAL: Again, low total number of errors but saw several jobs failing with "bus error" during the week-end for stripping
Sites / Services round table:
- ASGC: ntr
- BNL: ntr
- CNAF:
- FNAL: ntr
- GridPP:
- IN2P3:
- downtime tomorrow, batch system will be drained as of 22:00 CET
- JINR:
- KISTI:
- KIT: ntr
- NDGF: ntr
- NL-T1:
- OSG: ntr
- PIC:
- dCache upgrade to 2.10 next Mon
- HTTP read-only access to be enabled also for LHCb next week
- RAL:
- there is a problem with the ATLAS SRM since yesterday
- a downtime has been declared
- the trouble may come from "bad" file names with double slashes
- under investigation
- RRC-KI:
- TRIUMF:
- CERN batch and grid services:
- CERN storage services:
- Databases:
- GGUS:
- GGUS update on the 10th of December, including cleanup of non-used VOs, removed ticket category 'SPAM',
- Supporters using SNOW are reminded to keep using GGUS as the main tool until the synchronization issues have been solved
- Grid Monitoring:
- MW Officer:
AOB:
Thursday
Attendance:
- local: Daniel (grid services), Maarten (SCOD + ALICE), Pablo (GGUS + grid monitoring), Stefan (LHCb), Ulf (NDGF)
- remote: Andrej (ATLAS), Antonio (CNAF), Christian (NDGF), Christoph (CMS), Dennis (NLT1), Michael (BNL), Pepe (PIC), Rob (OSG), Rolf (IN2P3), Thomas (KIT), Tiju (RAL), Tsung-Hsun (ASGC), Young-Bok (KISTI)
Experiments round table:
- ATLAS reports (raw view) -
- CentralService & Tier0/Tier1s
- FZK issue GGUS:110693 , not reproducible but "solved by itself".
- Andrej:
- a few sites complained about jobs from 1 task running in multi-core mode in single-core queues; the task was killed and the cause investigated to avoid such incidents
- low activity because of Rucio and ProdSys2 commissioning, but also awaiting SW readiness for Run 2 Monte Carlo campaign
- disks are quite full: the normal deletion activities are being monitored carefully after last week's incident
- CMS reports (raw view) -
- Still some trouble with CMS web services
- Testing some rather high I/O jobs on HLT (with EOS experts in the loop)
- LHCb reports (raw view) -
- "Legacy Run1 Stripping" campaign running full steam and progressing well + MC and user jobs.
- T0: Two GGUS tickets open and not answered yet
- GGUS:110604 - investigation of higher failure rate for stripping jobs at CERN
- GGUS:110583 - SAM probes were timing out at CERN queues
- T1:
- IN2P3:
- Stefan: ~1500 jobs had to be killed on the morning of the intervention; next time we will try to have less fallout
Sites / Services round table:
- ASGC: ntr
- BNL: ntr
- CNAF: ntr
- FNAL:
- GridPP:
- IN2P3:
- the downtime on Tue generally went OK; Xrootd was back later than foreseen
- JINR:
- KISTI: ntr
- KIT:
- updating CMS dCache to 2.11 today
- updated ATLAS dCache on Tue plus Wed
- in the downtime extension the CMS SRM was mentioned accidentally
- Christoph: fixed after its effect on PhEDEx operations was noticed
- a high load ob the GridFTP doors was observed since the upgrade; being investigated
- NDGF:
- 1 tape with ATLAS data got destroyed; under investigation
- NL-T1: ntr
- OSG: ntr
- PIC:
- next Mon downtime for dCache upgrade to 2.10
- RAL:
- the ATLAS SRM instabilities were fixed on Mon
- CMS and ATLAS CASTOR head nodes were upgraded to SL6 on Tue and Wed
- RRC-KI:
- TRIUMF:
- CERN batch and grid services: ntr
- CERN storage services:
- Databases:
- GGUS:
- New release of GGUS done on wednesday. All alarms have been acknowledged.
- Snow fixed the synchronization issue with GGUS. Supporters using GGUS and SNOW can go back to use their preferred interface
- Grid Monitoring: ntr
- MW Officer:
AOB: