Week of 131014
WLCG Operations Call details
To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
The scod rota for the next few weeks is at
ScodRota
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Attendance:
- local: Belinda, Maarten, Pablo, Ulrich
- remote: Alexey, David, Gareth, Kyle, Lucia, Onno, Pepe, Peter, Rolf, Sang-Un, Thomas, Wei-Jen, Xavier
Experiments round table:
- ATLAS reports (raw view) -
- NTR
- issues from last Thursday have been addressed.
- CMS reports (raw view) -
- Continuing 2011 Legacy rereco and some MC production -- otherwise very little to report!
- GGUS:97974 CVMFS bad node at KIT -- fixed right away
- GGUS:97938 Transfers failing to T2_RU_SINP since end of September -- still waiting for news!
- ALICE -
- the outage due to the AliEn CA having expired was resolved Thu late afternoon; all production sites were working again by Fri afternoon
- Pablo: the renewed CA has a lifetime of 20 years
- LHCb reports (raw view) -
- Main activities are incremental stripping (T0/1) and Simulation
- T0: NTR
- T1: NTR
Sites / Services round table:
- ASGC
- power cut this morning, all services down, working on recovery
- CNAF - ntr
- IN2P3 - ntr
- KISTI - ntr
- KIT
- Wed-Thu at-risk downtime for tape library maintenance requiring a 1h cut on both days
- NDGF
- Wed morning FTS downtime to patch Oracle
- NLT1
- FTS downtime Tue
- NDGF-NLT1 transfer failures reported in GGUS:97937: no network issue found on our side, can NDGF experts look into their dCache logs?
- Thomas: will inform the experts
- OSG
- was there an issue with MyWLCG late last week?
- Maarten: nothing we know of here
- Pablo: there was a transparent DB cleanup on Tue
- PIC - ntr
- RAL - ntr
- dashboards - ntr
- GGUS: (Text entered by MariaD before leaving for CHEP)
- Progress on the GGUS ALARMers' disappearance can be followed in GGUS:97755.
- Progress on the missing operators' notification for the Oct 7th ALICE ALARM can be found in INC:406556 and GGUS:97817.
- the problem was due to an inadvertent advance change of location of the files necessary to sign emails
- a test alarm was sent OK to KIT after correcting the issue
- a full set of alarm tests will be done on Oct 23 for the next GGUS release
- grid services
- high load on batch system last week, reaching the 300k maximum of jobs being managed; the 2 top users have throttled their jobs and the service is OK for the time being
- the issue will be mitigated as more SLC6 capacity gets put into production
- storage
- CASTOR upgrades with downtimes from 09:00 to 14:00 CEST:
- Oct 21 ATLAS
- Oct 28 ALICE + CMS
- Oct 29 LHCb + public
- EOS upgrades:
- Oct 23 ATLAS
- others to be decided later
AOB:
Thursday
Attendance:
- local: Belinda, Jacobo, Luca M, Maarten
- remote: Alexey, David, Kyle, Lisa, Lucia, Michael, Pepe, Peter, Rolf, Ronald, Sang-Un, Thomas, Tiju, Wei-Jen, Xavier
Experiments round table:
- ATLAS reports (raw view) -
- Central services
- FTS error with proxy handling. A known bug seen previously. Can central services take a look and understand the issue, perhaps apply a patch. (GGUS:97960) (ATLAS Twiki)
- Maarten: there have been similar errors on a few FTS-2 instances in the last few months; those instances were running the latest version, so the problems would not have the same cause as last year; the first thing to try for the FTS admin is to restart the service, which will force the daemons to reload their memory caches and/or certain DB tables with up-to-date information; feel free to open tickets for the FTS developers, but these recurrent issues give more ammunition for a quick replacement of FTS-2 by FTS-3, hopefully still this year
- Luca: GGUS:97662 is about a missing file in CASTOR-ATLAS, that got garbage-collected due to its directory having a no-tape file class, while ATLAS expected the file to have been on tape
- Peter: the directory file class actually looks correct, will follow up with ADC colleagues and update the ticket
- ALICE -
- CERN: as of late yesterday evening first successful jobs on SLC6 using CVMFS
- the SLC6 share of ALICE has been very low so far...
- SLC5 jobs are still using Torrent for now
- US T2 LLNL needed to be switched off due to government shutdown!
- should be back again soon...
- LHCb reports (raw view) -
- Main activities are incremental stripping (T0/1) and Simulation
- T0: NTR
- T1: NTR
Sites / Services round table:
- ASGC
- recovered OK from power cut on Mon
- BNL - ntr
- CNAF - ntr
- FNAL - ntr
- IN2P3
- on Oct 23 the SL6 batch capacity will increase from 50 to 90%
- KISTI - ntr
- KIT - ntr
- NDGF - ntr
- NLT1 - ntr
- OSG
- RSV transfers to SAM failed ~8h ago, the data are being sent again and the cause is under investigation
- PIC - ntr
- RAL - ntr
- dashboards
- SAM DB intervention Mon Oct 21 14:00 CEST, test results may be delayed by 30 min
- storage
- reminder: ATLAS CASTOR and EOS interventions on Mon and Wed next week
AOB: