Week of 200217
WLCG Operations Call details
- For remote participation we use the Vidyo system. Instructions can be found here.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Portal
- Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.
Best practices for scheduled downtimes
Monday
Attendance:
- local: Kate (WLCG, chair), Julia (WLCG), Maarten (WLCG, ALICE), Michal (ATLAS), Olga (computing), Marian (network, monitoring), Remy (storage), Nicolo (storage)
- remote: Andrew (NL-T1), Jens (NDGF), Vladimir (LHCb), Darren (NDGF), Elena (CNAF), Christoph (CMS), Dave M (FNAL), Sang Un (KISTI)
Experiments round table:
- ATLAS reports ( raw view) -
- Activities:
- Issues
- ATLAS dashboards showing no data between Wednesday evening and Thursday morning (RQF:1527003)
- The monitoring project on HDFS ran out of quota for the "number of files" (OTG:0054839)
- few not fully filled bins in running jobs plots (RQF:1513867, RQF:1529407)
- The InfluxDB database was not responsive since yesterday night. Database is operational again and the holes recovered.
- atlas-in2p3-cc-frontier degraded (GGUS:145575)
- pilots failing at IN2P3-CC - mistake in HTCondorCE config
- pilots at BNL_PROD_UCORE are failing with "atlas software repository NOT found" (GGUS:145539, GGUS:145529) - few broken WNs
- Transfer timeouts from FZK-LCG2_MCTAPE (GGUS:145564)
- LHCb reports ( raw view) -
- Activity:
- Stripping campaign ongoing, occupying most of T0/1 capacity (no T2s)
- Staging for 2016 is almost finished. Staging for 2017 is ongoing.
- Issues:
Sites / Services round table:
- ASGC: nc
- BNL: nc
- CNAF: NTR
- EGI: nc
- FNAL: DCache upgrades finished for both disk and tape
- IN2P3: IN2P3-CC will be in maintenance on March 17th, a Tuesday. As usual details will be available one week before the event. CEs and SEs are foreseen to be in downtime for the whole day.
- JINR: NTR
- KISTI: NTR
- KIT: NTR
- NDGF: NTR
- NL-T1: NTR
- NRC-KI: nc
- OSG: nc
- PIC: Tomorrow Tue. 18th Feb. we will be doing a re-configuration in PIC's firewall. The intervention should be transparent for the users.
- RAL: nc
- TRIUMF: (Feb17 is a holiday here)
- TRIUMF_SIM had no jobs running for ~ 1day, workernodes (3K cores) located at TRIUMF were inadvertently using legacy configurations for CVMFS cache squid servers.
- Three tape cartridge had media error - all 324 files recovered.
- CERN computing services: NTR
- CERN storage services: NTR
- CERN databases: NTR
- GGUS: NTR
- Monitoring:
- Final reports for the January availability sent around
- ETF LHCb/ALICE: GGUS:145475 - glite-ce clients broken in latest UMD4; glite-ce and condor clients coexistence appears challenging going forward
- ETF LHCb: change of VOMS role used in testing is planned (to /lhcb/Role=samtest)
- MW Officer: nc
- Networks: EELA-UTFSM and CBFP networking issues were resolved
- Security: nc
AOB: Christoph asked why LHCb is planning the change of the
VOMS role. Marian explained that it will be done not to use the production credential. Maarten will check if such a high effort change is necessary.