LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes200305 (2020-03-12, MaartenLitmaath)

EditAttachPDF

WLCG Operations Coordination Minutes, March 5, 2020

Highlights
Agenda
Attendance
Operations News
- CERN databases down March 11 morning
Special topics
- CERN CEPH postmortem
  - Discussion
- Migration from CREAM. Status overview.
Middleware News
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
- ALICE
- ATLAS
- CMS
  - Comments
- LHCb
Task Forces and Working Groups
Action list
- Specific actions for experiments
- Specific actions for sites
AOB

Highlights

Agenda

https://indico.cern.ch/event/893845/

Attendance

Alessandro Paolini (EGI), David Cohen(IL-HEPTier-2), Eric Fede (IN2P3), Johannes Elmsheuser (ATLAS),Giuseppe Bagliesi (CMS), Stephan Lammel (CMS), Concezio Bozzi (LHCb), Matthew Steven Doidge (Lancaster), Di Qing (Triumf), Pepe Flix (pic), Ron Trompert (Sara), Andreas Petzold (KIT), Catalin Condurache (EGI), Marian Babik (CERN, ETF, network), Gavin McCance (CERN compute), Mayank Sharma (CERN, WLCG ops), Dan van der Ster (CERN storage), Maarten Litmaath (CERN, ALICE, WLCG ops), Julia Andreeva (CERN, WLCG Ops)

Operations News

CERN databases down March 11 morning

On Wed March 11 from 08:00 CET there will be a network intervention
affecting many hundreds of databases at CERN. In particular:

All Oracle databases will be inaccessible for at least a few min: OTG:0054761

More than 600 DB-on-demand instances will be down: OTG:0054760

Many of all those DBs are back-ends for grid and other kinds of services.
The respective service owners should already have been informed.

The downtimes are planned to be finished by 09:00 CET,
but unexpected fallout would not be unthinkable.

Special topics

CERN CEPH postmortem

See the presentation

Discussion

Julia asked experiments about impact on their activities and whether communication worked well
Johannes for ATLAS: It was clear that underlying problem was CEPH. Pointed to similar CEPH incident a month before. Required manual restart of multiple services which do rely on CEPH storage. ATLAS considered CEPH as a sort of 100% reliable service. As a lesson learned , ATLAS is looking what can be done on the experiment side to decrease impact in case of CEPH outages.
Dan pointed out that previous incident had a different reason and was caused by network problem. Measures for increasing redundancy in such cases have been taken. Though being pretty reliable, CEPH is not 100% reliable.
Concezio for LHCb: There was impact on production which caused some inconsistent situation (some files processed twice, some were not processed). Pointed out that it would help if a proper outage would be registered in GocDB, since LHCb workflows take downtime info in GocDB into account.
Gavin apologized that it was not done, CERN compute facility was not announced in downtime in GocDB.
Stephan for CMS : CMS was not hit too badly since not too many services do depend on CEPH. Moreover, there is a redundant setup in FNAL to protect from certain CERN outages. Some systems which do depend on AFS and indirectly on CEPH suffered, but the impact was not so serious.
All experiments mentioned, that due to the fact that the problem happened during working days and during CERN working hours, necessary measures were taken quickly. Communication worked well.

Migration from CREAM. Status overview.

See the presentation

Middleware News

Useful Links
Baselines/News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

Mostly business as usual
Some fallout from the Ceph incident at CERN on Thu Feb 20

ATLAS

Smooth and stable Grid production with ~430k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~90k slots from the HLT/Sim@CERN-P1 farm.
Large amounts of derivation production for the Moriond conferences put a bit of pressure on the grid disk space, expect some deletion this week to alleviate the situation.
Recovered from the second CERN CEPH incident rather quickly after experts restarted affected systems.
No other major other issues apart from the usual storage or transfer related problems at sites.
Continuing the RAW/DRAW reprocessing campaign in data/tape carousel mode: finished with data18 and started with data17 (Feb 19), approaching 50% done. Not using SARA for the moment and staging its share from CERN/CTA. Expect to start data16 in around 2 weeks
Started with TPC testing using webdav in production with a few sites - uncovered a couple of storage related problems and stopped until these are fixed.
Grand unification of PanDA queues on-going.

CMS

global run with first phase-2 detector configuration in mid February
running at about 265k cores during last month
- usual production/analysis mix (80%/20%)
- ultra-legacy re-reconstruction of 2017 and 2018 data in progress
- large pre-UL and UL Monte Carlo activities
CERN CEPH outage and IPv6 networking issues
- manual intervention to recover
- global pool managers moved to Fermilab (revealed performance impact of monitoring queries)
- at fortunate times thus not too disruptive

Comments

CMS would appreciate followup for IPv6 connectivity issues with web services at CERN. Julia will follow up.

LHCb

business as usual, ~100k concurrent jobs
Tier0+Tier1: running legacy stripping of Run1+2 pp collision data, 2016 nearly done, 2017 just started, all other years completed
- 2016 data stripping impacted by CERN CEPH incident of February 20th, but eventually managed to recover manually
stripping of p-Ion and Ion-Ion collision data will follow
MC production running elsewhere

Task Forces and Working Groups

Validation of the monthly accounting data is performed every month since November. We kindly ask site support teams to spend couple of minutes monthly on validation in order to improve accounting data quality and fixing eventual issues with APEL and WSSA.

CRIC is taking over from EGI accounting portal generation of monthly accounting reports. January accounting report generated by CRIC for T1s has been sent around for people to check and send their feedback. CRIC accounting reports will become official starting from summer this year. By this time REBUS will retire.

Archival Storage WG

Containers WG

CREAM migration TF

See the presentation

Documentation: CreamMigrationTaskForce

dCache upgrade TF

Good progress. Out of 45 dCache sites 34 have upgraded. 11 still to go. Still experience an issue with SRR and waiting for the fix from dCache developers.

DPM upgrade TF

Close to accomplish. Out of 55 sites, 39 has upgraded and re-configured. 1 site is suspended from operations. 2 still to upgrade, 5 to re-configure. 3 sites migrated to a different solution. 5 are in process of migrating to a different solution.

Information System Evolution TF

REBUS retirement plan has been presented to the MB and agreed. In April REBUS will go in read-only mode. Manual injection of storage capacity and storage accounting data in REBUS will be replaced by the accounting validation in CRIC, which is already in production and is performed by sites monthly since November 2019. WLCG CRIC will become the primary data source for pledges and federation topology. REBUS will retire in the beginning of summer.

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG

LHCOPN/LHCONE Asia (8-9th March) was re-scheduled to take place during HSF/WLCG workshop, but as the workshop is postponed there will be half-day virtual meeting instead (on 13th of May)
perfSONAR infrastructure status - 4.2.3 version was released this week
- Release notes: https://www.perfsonar.net/release-notes/version-4-2-3/
- Some users have reported an issue with missing throughput/trace tests after updating, bugfix release is expected shortly
OSG/WLCG infrastructure
- New dasboards were created to provide high-level overview of packet loss, throughput, latency and traceroutes (https://atlas-kibana.mwt2.org/s/networking/goto/20dd25907d61df98a0b85b1dfaed54e1)
- Still work in progress but feedback is welcome
- Dashboards are already showing some of the results of the analytical studies that were performed as part of SAND project
- The aim is to make it easier to identify new issues that are not easy to spot by the experiments data management systems (network instabilities that could impact network performance in undeterministic ways).
Looking into dublin-traceroute (extension of paris-traceroute) and possible integration in perfSONAR
100 Gbps perfSONAR testbed mailing list to join: http://cern.ch/simba3/SelfSubscription.aspx?groupName=wlcg-perfsonar-100g
WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Traceability WG

Action list

Creation date	Description	Responsible	Status	Comments

Specific actions for experiments

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

Specific actions for sites

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

AOB

Next meeting is scheduled for the 2nd of April

Topic revision: r12 - 2020-03-12 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback

WLCG Operations Coordination Minutes, March 5, 2020

Highlights

Agenda

Attendance

Operations News

CERN databases down March 11 morning

Special topics

CERN CEPH postmortem

Discussion

Migration from CREAM. Status overview.

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

ATLAS

CMS

Comments

LHCb

Task Forces and Working Groups

Upgrade of the T1 storage instances for TPC

GDPR and WLCG services

Accounting TF

Archival Storage WG

Containers WG

CREAM migration TF

dCache upgrade TF

DPM upgrade TF

Information System Evolution TF

IPv6 Validation and Deployment TF

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG

Traceability WG

Action list

Specific actions for experiments

Specific actions for sites

AOB