Week of 130722
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
The scod rota for the next few weeks is at
ScodRota
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Attendance:
- local: AndreaS, Steve, Maarten, Ken/CMS, Eddie, Massimo
- remote: Alexander/NL-T1, Michael/BNL, Peter/ATLAS, Xavier/KIT, Saverio/CNAF, Boris/NDGF, Nadia/IN2P3-CC, Kyle/OSG, Tiju/RAL, Lisa/FNAL, Wei-Jen/ASGC, Sang-Un/KISTI
Experiments round table:
- ATLAS reports (raw view) -
- T0/Central services
- T1
- BNL file recovery progressing
- RAL disk server problem. All files successfully recovered.
- SARA Transfer errors (Failed to contact on remote SRM) GGUS:95911. Fixed
- CMS reports (raw view) -
- Continuing 2011 legacy rereco activity and some Upgrade MC generation, everything pretty quiet
- From last week, RAL GGUS:95825 was declared a Castor bug.
- Castor T1TRANSFER service was degraded again over the weekend. But still not a real problem; probably deserves some discussion with CMS as to what the shifter instructions should be. [Massimo: the problem is related to an increasing mismatch between the CASTOR tuning and the evolution of the CMS usage patterns; it would be good to have a meeting where to discuss these issues.]
- GGUS:95887 to T1_FR_IN2P3 is about file read errors. Ticket came Friday, site started working on it today.
- Wigner T-Systems 100 Gbit link down for several hours over the weekend. [Maarten: Wigner vs. CERN computer centre should be transparent: don't mind too much about SLS errors unless they have a direct effect. Ken: will check if there was any impact on CMS.]
- ALICE -
- CERN: all CERN CEs started refusing proxies of ALICE and other VOs around 18:20 Sunday evening; fixed around 09:15 Monday morning (TEAM ticket GGUS:95914) [Maarten: the problem was most likely related to a bad ARGUS configuration and it was solved by Ulrich but there are no details in the ticket.]
Sites / Services round table:
- ASGC: ntr
- BNL: ntr
- CNAF: ntr
- FNAL: ntr
- IN2P3-CC: ntr
- KIT: ntr, just a reminder of a downtime starting tomorrow by closing the queues to update the SE for LHCb
- KISTI: ntr
- NDGF: SRM downtime tomorrow from 7 to 9, to update dCache and install a new host certificate; some disk pools will also be updated.
- NL-T1: ntr
- RAL: tomorrow, "at risk" intervention to the site firewall, followed from 10 to 16 by a CASTOR upgrade for LHCb and CMS
- OSG: ntr
- CERN batch and grid services: ntr
- CERN storage services: just upgraded SRM for all experiments (apart from LHCb, which was upgraded last week). The upgrade should have been transparent.
- Dashboards: ntr
AOB:
Thursday
Attendance:
- local: Eddie, Eva, Luca M, Maarten, Marcin, Nilo, Steve
- remote: Boris, Dennis, Federico, Gareth, Kyle, Lisa, Matteo, Michael, Nadia, Pavel, Peter, Sang-Un, Stefano, Wei-Jen
Experiments round table:
- ATLAS reports (raw view) -
- T0/Central services
- ATLR database outage yesterday evening, looks fine now INCIDENT
- Marcin:
- since ~19:00 there was a problem with the NFS filer serving several DBs including ATLR
- the fail-over mechanism was unable to move the load from 1 node to another
- the affected DBs thus remained inaccessible
- a priority-one case was opened with NetApp
- the problem was solved shortly after midnight
- there was no data loss
- a HW problem was identified
- spare parts are expected this Fri
- Peter: are the DBs currently at risk?
- Marcin: yes, fail-over is not possible at the moment
- T1
- All except cern and bnl PRODDISK endpoints have been removed from ATLAS data management
- FZK and DE T2 sites now configured to use FTS3
- CMS reports (raw view) -
- T1 disk/tape separation is getting done (RAL first site) and we are starting to nail out details of how to send users analysis jobs to T1's using pilots
- Continuing 2011 legacy rereco activity and some Upgrade MC generation, everything pretty quiet
- GGUS:95887 to T1_FR_IN2P3 about file read error solved by reconfiguring dCache there (more time to dCache movers). Details not understood.
- GGUS:96102: file read error at T1_UK_RAL . Site now looking into this.
- nothing else worth mentioning
- LHCb reports (raw view) -
- Mostly MC productions ongoing, tail of reprocessing and restripping campaign
- T0:
- CERN: Issues with Oracle DBs
- T1:
Sites / Services round table:
- ASGC - ntr
- BNL - ntr
- CNAF
- ATLAS job failures due to lack of disk space being investigated
- CMS file system maintenance, queues being drained, HammerCloud tests will fail; intervention should be finished tomorrow
- FNAL - ntr
- IN2P3 - ntr
- KISTI
- the second half of the 2013 CPU pledge will become available in Sep
- KIT
- still in scheduled downtime, no problems observed so far, expect to finish a bit earlier than announced
- main firewall upgraded OK
- dCache work ongoing
- GPFS, CVMFS, Xrootd, FTS, ... done
- NDGF
- yesterday there was a problem when ATLAS tried writing to disks in Sweden for which SRM reported the wrong size: they became full; the matter is being investigated
- NLT1 - ntr
- OSG - ntr
- RAL
- CASTOR upgrade for CMS and LHCb went OK on Tue
- generic instance (also used by ALICE) will be done next Tue
- batch server problem being investigated
- dashboards - ntr
- databases
- grid services
- yesterday 4 out of 8 CREAM CEs have been upgraded to EMI-3
- storage - ntr
AOB: