Week of 150817
WLCG Operations Call details
- At CERN the meeting room is 513 R-068.
- For remote participation we use the Vidyo system. Instructions can be found here.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Web
- Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.
Monday
Attendance:
- local: AndreaV/SCOD, Michal/ATLAS, LucaM/Storage, Jesus/Storage, Alberto/Grid, Andrei/Databases
- remote: LucaT/LHCb, Michael/BNL, Lisa/FNAL, Jeremy/!GridPP, Sang-Un/KISTI, Christian/NDGF, Onno/NLT1, Pepe/PIC, John/RAL
Experiments round table:
- ATLAS reports ( raw view) -
- BNL - transfers to LOCALGROUPDISK were because of insufficient space. BNL added space
- RAL - T0 export to DATATAPE is failing with timeouts
- CERN - T0 export from TZDISK is failing with "could not open connection to srm-atlas.cern.ch"
- ALICE -
- High activity.
- CERN: reco jobs at times still unable to read raw data files from CASTOR.
- The CASTOR experts are aware and applied a temporary cure on Sun, thanks!
- To be followed up.
- LHCb reports ( raw view) -
- Data processing: validation of 25ns data has started, ~70% already processed
- No issues from Tier0/1
Sites / Services round table:
- ASGC: -
- BNL/Michael: ntr
- CNAF: -
- FNAL/Lisa: ntr
- GridPP/Jeremy: ntr
- IN2P3: ntr
- JINR: -
- KISTI/Sang-Un: ntr
- KIT: -
- NDGF/Christian: there will be a downtime on Wednesday, data will be unavailable in the evening
- NL-T1/Onno:
- RAID issue on a dcache pool node, fixed this morning but still need to investigate (had the same problem 5 weeks ago), will send it back to the vendor after moving the data (migration should be complete on Wednesday)
- following up another issue with dcache that was disabling pools: fixed today one of the two reasons for this and greatly reduced the issue (fixed a bug in the mass storage script); the other reason is a timeout in the Berkeley DB, for which a workaround exists (moving the DB to a different block device)
- NRC-KI: -
- OSG: ntr
- PIC/Pepe: ntr
- RAL/John: at risk tomorrow for one of the site routers
- TRIUMF: -
- CERN batch and grid services / Alberto:
- New CE based in HTCondor CE (ce503) broadcasting in BDII.
- CERN storage services / LucaM:
- following up the issues reported by ALICE and by ATLAS
- will do an emergency change in CASTOR tomorrow for CMS to allow third party copy (had an issue last week)
- Databases / Andrei : ntr
- GGUS: -
- Grid Monitoring:
- Availability and Reliability reports for July sent to the WLCG office
- MW Officer: -
AOB: ntr
Thursday
Attendance:
- local: AndreaV/SCOD, Michal/ATLAS, LucaT/LHCb, Christoph/CMS, Alberto/Grid, LucaC/Databases, Jesus/Storage
- remote: Dennis/NLT1, Christian/NDGF, John/!RAL, Sang-Un/KISTI, DiQing/TRIUMF, Jeremy/!GridPP, Pepe/PIC, Rolf/IN2P3, Thomas/KIT, Rob/OSG
Experiments round table:
- ATLAS reports ( raw view) -
- CERN - expired credentials failures - host certificate in the two machines updated
- CERN - "Unable to issue PrepareToGet request to Castor"
- ATLAS is replicating files which are in Taiwan and CERN but transfers from Taiwan are failing
- to improve the situation, smaller BringOnLine requests should be used - can be done in next FTS release
- SARA - no pilots because of DNS issue
- pic - no pilots at HIMEM
- CMS reports ( raw view) -
- Holiday times - not much Computing activities
- Transfer problem from CASTOR to (CERN) EOS
- Issues with third-party transfers
- Needed a fix on CASTOR
- Details: GGUS:115598
- Outages of DAS (Data Aggregation Service) last week
- Service got overloaded and introduced further problem to other CMS web-services
- Some tools are changed to query underlying services directly (bypassing the Aggregation Service)
- ALICE -
- High activity.
- NIKHEF: CVMFS was stuck after an update from CentOS 6.6 to 6.7 (GGUS:115783)
- LHCb reports ( raw view) -
- Data Processing: second run of validation for 25ns data started with new calibration, ~90% already processed. New stripping production will start later today.
- T0
- T1
- Nothing to report
- CVMFS problem at NIKHEF-ELPROD (GGUS:115767) due to an upgrade of their WNs from CentOS 6.6 to CentOS 6.7. Promptly solved.
- [!LucaT: is CVMFS issue understood? Is there a risk that other sites see the same issue if they upgrade? Dennis: this was supposed to be a standard upgrade on WNs and VO boxes, but cvmfs got stuck and had to do an autofs restart; it's not the first time that this issue happens. AndreaV: please report to the cvmfs team if not already done. Dennis: ok will do.]
- PBS issues at PIC (GGUS:115748) due to a high load in few WNs, pbs server had problems to refresh information, and denied some connections. Promptly solved.
Sites / Services round table:
- ASGC: -
- BNL: -
- CNAF: -
- FNAL: -
- GridPP/Jeremy: ntr
- IN2P3/Rolf: ntr
- JINR: -
- KISTI/Sang-Un: ntr
- KIT/Thomas: ntr
- NDGF/Christian: ntr
- NL-T1/Dennis: nta
- NRC-KI: -
- OSG/Rob: there was a mail problem at FNAL yesterday, if anyone sent tickets to OSG these have not reached their destination so please resend them
- PIC: -
- RAL/John: ntr
- TRIUMF/!DiQing: ntr
- CERN batch and grid services/Alberto: ntr
- CERN storage services/Jesus: ntr
- Databases/!lLucaC: ntr
- GGUS: -
- Grid Monitoring: -
- MW Officer: -
AOB: -