WLCG Operations Coordination Minutes, March 5, 2020
Highlights
Agenda
https://indico.cern.ch/event/893845/
Attendance
Alessandro Paolini (EGI), David Cohen(IL-HEPTier-2), Eric Fede (IN2P3), Johannes Elmsheuser (ATLAS),Giuseppe Bagliesi (CMS), Stephan Lammel (CMS), Concezio Bozzi (LHCb), Matthew Steven Doidge (Lancaster), Di Qing (Triumf), Pepe Flix (pic), Ron Trompert (Sara), Andreas Petzold (KIT), Catalin Condurache (EGI), Marian Babik (CERN, ETF, network), Gavin McCance (CERN compute), Mayank Sharma (CERN, WLCG ops), Dan van der Ster (CERN storage), Maarten Litmaath (CERN, ALICE, WLCG ops), Julia Andreeva (CERN, WLCG Ops)
Operations News
CERN databases down March 11 morning
On Wed March 11 from 08:00 CET there will be a network intervention
affecting many hundreds of databases at CERN. In particular:
- All Oracle databases will be inaccessible for at least a few min: OTG:0054761
- More than 600 DB-on-demand instances will be down: OTG:0054760
Many of all those DBs are back-ends for grid and other kinds of services.
The respective service owners should already have been informed.
The downtimes are planned to be finished by 09:00 CET,
but unexpected fallout would not be unthinkable.
Special topics
CERN CEPH postmortem
See the
presentation
Discussion
- Julia asked experiments about impact on their activities and whether communication worked well
- Johannes for ATLAS: It was clear that underlying problem was CEPH. Pointed to similar CEPH incident a month before. Required manual restart of multiple services which do rely on CEPH storage. ATLAS considered CEPH as a sort of 100% reliable service. As a lesson learned , ATLAS is looking what can be done on the experiment side to decrease impact in case of CEPH outages.
- Dan pointed out that previous incident had a different reason and was caused by network problem. Measures for increasing redundancy in such cases have been taken. Though being pretty reliable, CEPH is not 100% reliable.
- Concezio for LHCb: There was impact on production which caused some inconsistent situation (some files processed twice, some were not processed). Pointed out that it would help if a proper outage would be registered in GocDB, since LHCb workflows take downtime info in GocDB into account.
- Gavin apologized that it was not done, CERN compute facility was not announced in downtime in GocDB.
- Stephan for CMS : CMS was not hit too badly since not too many services do depend on CEPH. Moreover, there is a redundant setup in FNAL to protect from certain CERN outages. Some systems which do depend on AFS and indirectly on CEPH suffered, but the impact was not so serious.
- All experiments mentioned, that due to the fact that the problem happened during working days and during CERN working hours, necessary measures were taken quickly. Communication worked well.
Migration from CREAM. Status overview.
See the
presentation
Middleware News
- Useful Links
- Baselines/News
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- Mostly business as usual
- Some fallout from the Ceph incident at CERN on Thu Feb 20
ATLAS
- Smooth and stable Grid production with ~430k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~90k slots from the HLT/Sim@CERN-P1 farm.
- Large amounts of derivation production for the Moriond conferences put a bit of pressure on the grid disk space, expect some deletion this week to alleviate the situation.
- Recovered from the second CERN CEPH incident rather quickly after experts restarted affected systems.
- No other major other issues apart from the usual storage or transfer related problems at sites.
- Continuing the RAW/DRAW reprocessing campaign in data/tape carousel mode: finished with data18 and started with data17 (Feb 19), approaching 50% done. Not using SARA for the moment and staging its share from CERN/CTA. Expect to start data16 in around 2 weeks
- Started with TPC testing using webdav in production with a few sites - uncovered a couple of storage related problems and stopped until these are fixed.
- Grand unification of PanDA queues on-going.
CMS
- global run with first phase-2 detector configuration in mid February
- running at about 265k cores during last month
- usual production/analysis mix (80%/20%)
- ultra-legacy re-reconstruction of 2017 and 2018 data in progress
- large pre-UL and UL Monte Carlo activities
- CERN CEPH outage and IPv6 networking issues
- manual intervention to recover
- global pool managers moved to Fermilab (revealed performance impact of monitoring queries)
- at fortunate times thus not too disruptive
Comments
- CMS would appreciate followup for IPv6 connectivity issues with web services at CERN. Julia will follow up.
LHCb
- business as usual, ~100k concurrent jobs
- Tier0+Tier1: running legacy stripping of Run1+2 pp collision data, 2016 nearly done, 2017 just started, all other years completed
- 2016 data stripping impacted by CERN CEPH incident of February 20th, but eventually managed to recover manually
- stripping of p-Ion and Ion-Ion collision data will follow
- MC production running elsewhere
Task Forces and Working Groups
Upgrade of the T1 storage instances for TPC
GDPR and WLCG services
Accounting TF
- New functionality of WAU is enabled in production. WAU now provides three views
- Validation of the monthly accounting data is performed every month since November. We kindly ask site support teams to spend couple of minutes monthly on validation in order to improve accounting data quality and fixing eventual issues with APEL and WSSA.
- CRIC is taking over from EGI accounting portal generation of monthly accounting reports. January accounting report generated by CRIC for T1s has been sent around for people to check and send their feedback. CRIC accounting reports will become official starting from summer this year. By this time REBUS will retire.
Archival Storage WG
Containers WG
CREAM migration TF
See the
presentation
Documentation:
CreamMigrationTaskForce
dCache upgrade TF
- Good progress. Out of 45 dCache sites 34 have upgraded. 11 still to go. Still experience an issue with SRR and waiting for the fix from dCache developers.
DPM upgrade TF
- Close to accomplish. Out of 55 sites, 39 has upgraded and re-configured. 1 site is suspended from operations. 2 still to upgrade, 5 to re-configure. 3 sites migrated to a different solution. 5 are in process of migrating to a different solution.
Information System Evolution TF
- REBUS retirement plan has been presented to the MB and agreed. In April REBUS will go in read-only mode. Manual injection of storage capacity and storage accounting data in REBUS will be replaced by the accounting validation in CRIC, which is already in production and is performed by sites monthly since November 2019. WLCG CRIC will become the primary data source for pledges and federation topology. REBUS will retire in the beginning of summer.
IPv6 Validation and Deployment TF
Detailed status
here.
Machine/Job Features TF
Monitoring
MW Readiness WG
Network Throughput WG
- LHCOPN/LHCONE Asia (8-9th March) was re-scheduled to take place during HSF/WLCG workshop, but as the workshop is postponed there will be half-day virtual meeting instead (on 13th of May)
- perfSONAR infrastructure status - 4.2.3 version was released this week
- OSG/WLCG infrastructure
- New dasboards were created to provide high-level overview of packet loss, throughput, latency and traceroutes (https://atlas-kibana.mwt2.org/s/networking/goto/20dd25907d61df98a0b85b1dfaed54e1)
- Still work in progress but feedback is welcome
- Dashboards are already showing some of the results of the analytical studies that were performed as part of SAND project
- The aim is to make it easier to identify new issues that are not easy to spot by the experiments data management systems (network instabilities that could impact network performance in undeterministic ways).
- Looking into dublin-traceroute (extension of paris-traceroute) and possible integration in perfSONAR
- 100 Gbps perfSONAR testbed mailing list to join: http://cern.ch/simba3/SelfSubscription.aspx?groupName=wlcg-perfsonar-100g
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Traceability WG
Action list
Specific actions for experiments
Specific actions for sites
AOB
- Next meeting is scheduled for the 2nd of April