Week of 150126
WLCG Operations Call details
- At CERN the meeting room is 513 R-068.
- For remote participation we use the Vidyo system. Instructions can be found here.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Web
- Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.
Monday
Attendance:
- local: Luca (SCOD+storage), Maarten (ALICE), Ben (grid services)
- remote: Andrej (ATLAS), Mark (LHCb), Di (TRIUMF), Felix (ASGC), Alexander (NLT1), Lisa (FNAL), Rolf (IN2P3), Tiju (RAL), Antonio (CNAF), Sang Un Ahn (KISTI), Ulf (NDGF), Rob (OSG)
Experiments round table:
- ATLAS (Andrej) reports (raw view) -
- lack of running jobs due to small issues in PANDA and JEDI
- some issue with misconfigured jobs accessing ATLAS web instead of CVMFS
- ALICE (Maarten) -
- continued high activity
- new record today: 75k concurrent jobs!
- data loss at SARA (NLT1)
- 108k ALICE files (~8 TB) were lost
- catalog cleanup and re-replication have finished
- 11804 files had a single replica there, none of which were very important anymore
- mainly logs and intermediate QA results
- SARA:
- 1 week needed for recovering the hardware affected (less available storage)
- there are no possibilities to recover any of the files that were present on the failed hardware
- LHCb (Mark) reports (raw view) -
- "Legacy Run1 Stripping" waiting on FTS transfers and recovery from lost files. Large MC campaign upcoming + User jobs but not much happening right now.
- T0: NTR
- T1: NTR
Sites / Services round table:
- ASGC: NTR
- BNL:
- CNAF: NTR
- FNAL: NTR
- GridPP:
- IN2P3: NTR
- JINR:
- KISTI: NTR
- KIT: NTR
- NDGF: NTR
- NL-T1: NTR
- NRC-KI:
- OSG: NTR
- PIC:
- RAL: NTR
- TRIUMF:
- 4 hours of scheduled downtime for power outage Thu 29th January from 1:00am to 5:00am CET
- CERN batch and grid services: NTR
- CERN storage services:
- Tue 27th January 10:00am 11:00am CET - CASTOR transparent update of the nameserver from 10
- Tue 27th January 9:30am 12:00am CET - EOSATLAS upgrade to EOS aquamarine
- Databases:
- GGUS:
- Grid Monitoring:
- MW Officer:
AOB:
Thursday
Attendance:
- local: Luca (SCOD+storage), Maarten (ALICE), Manuel (grid services), Mark (LHCb), Pablo (GGUS)
- remote: Andrej (ATLAS), Christoph (CMS), Felix (ASGC), Dennis (NLT1), Lisa (FNAL), Michael (BNL), Rolf (IN2P3), Thomas (KIT), Salvatore (CNAF), Jeremy (GridPP), Gareth (RAL), Rob (OSG)
Experiments round table:
- ATLAS (Andrej) reports (raw view) -
- issue with FTS3 callbacks not working properly, already in contact with FTS developers
- issue due to CONDOR update of the pilot factories (missing shared library)
- Reminder: next week is ATLAS Software and Computing week. Most probably, no one from ATLAS will be able to attend the daily meetings.
- CMS (Christoph) -
- Main activity: Digi-Reco of upgrade (2020+) MC at Tier-1s
- Next round of tape staging tests at Tier-1s started today
- Consistency checks (Site inventory vs. CMS catalogue) at all Tier-1s ongoing
- LHCb (Mark) reports (raw view) -
- "Legacy Run1 Stripping" waiting on FTS transfers and recovery from lost files. Large MC campaign upcoming + User jobs but not much happening right now.
- T0: NTR
- T1: Status of the incident report for the lost files at SARA
Sites / Services round table:
- ASGC: NTR
- BNL: NTR
- CNAF: ongoing transparent upgrade of linux kernel and glibc
- FNAL: found issue in the latest GGUS release, message was sent to a different mail address (see Pablo explanation)
- GridPP: NTR
- IN2P3: NTR
- JINR:
- KISTI:
- KIT: NTR
- NDGF:
- NL-T1: NTR
- NRC-KI:
- OSG: NTR
- PIC:
- RAL:
- on Monday morning deploy hotfix for CASTOR
- on Tuesday switch to the network backup link for OPN
- TRIUMF:
- CERN batch and grid services:
- Reminder about VOMRS changes. A testing instance is available and experiments are welcome to test and provide feedback (see GGUS:110227 )
- Reminder about closure of AFS UI on Monday 2nd Feb (more details here )
- CERN storage services: NTR
- Databases:
- GGUS:
- Grid Monitoring: NTR
- MW Officer:
AOB:
--
AndreaSciaba - 2014-12-16