WLCG Operations Coordination Minutes, March 5, 2020

Highlights

Agenda

https://indico.cern.ch/event/893845/

Attendance

Alessandro Paolini (EGI), David Cohen(IL-HEPTier-2), Eric Fede (IN2P3), Johannes Elmsheuser (ATLAS),Giuseppe Bagliesi (CMS), Stephan Lammel (CMS), Concezio Bozzi (LHCb), Matthew Steven Doidge (Lancaster), Di Qing (Triumf), Pepe Flix (pic), Ron Trompert (Sara), Andreas Petzold (KIT), Catalin Condurache (EGI), Marian Babik (CERN, ETF, network), Gavin McCance (CERN compute), Mayank Sharma (CERN, WLCG ops), Dan van der Ster (CERN storage), Maarten Litmaath (CERN, ALICE, WLCG ops), Julia Andreeva (CERN, WLCG Ops)

Operations News

CERN databases down March 11 morning

On Wed March 11 from 08:00 CET there will be a network intervention
affecting many hundreds of databases at CERN. In particular:

  • All Oracle databases will be inaccessible for at least a few min: OTG:0054761

  • More than 600 DB-on-demand instances will be down: OTG:0054760

Many of all those DBs are back-ends for grid and other kinds of services.
The respective service owners should already have been informed.

The downtimes are planned to be finished by 09:00 CET,
but unexpected fallout would not be unthinkable.

Special topics

CERN CEPH postmortem

See the presentation

Discussion

  • Julia asked experiments about impact on their activities and whether communication worked well
  • Johannes for ATLAS: It was clear that underlying problem was CEPH. Pointed to similar CEPH incident a month before. Required manual restart of multiple services which do rely on CEPH storage. ATLAS considered CEPH as a sort of 100% reliable service. As a lesson learned , ATLAS is looking what can be done on the experiment side to decrease impact in case of CEPH outages.
  • Dan pointed out that previous incident had a different reason and was caused by network problem. Measures for increasing redundancy in such cases have been taken. Though being pretty reliable, CEPH is not 100% reliable.
  • Concezio for LHCb: There was impact on production which caused some inconsistent situation (some files processed twice, some were not processed). Pointed out that it would help if a proper outage would be registered in GocDB, since LHCb workflows take downtime info in GocDB into account.
  • Gavin apologized that it was not done, CERN compute facility was not announced in downtime in GocDB.
  • Stephan for CMS : CMS was not hit too badly since not too many services do depend on CEPH. Moreover, there is a redundant setup in FNAL to protect from certain CERN outages. Some systems which do depend on AFS and indirectly on CEPH suffered, but the impact was not so serious.
  • All experiments mentioned, that due to the fact that the problem happened during working days and during CERN working hours, necessary measures were taken quickly. Communication worked well.

Migration from CREAM. Status overview.

See the presentation

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Mostly business as usual
  • Some fallout from the Ceph incident at CERN on Thu Feb 20

ATLAS

  • Smooth and stable Grid production with ~430k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~90k slots from the HLT/Sim@CERN-P1 farm.
  • Large amounts of derivation production for the Moriond conferences put a bit of pressure on the grid disk space, expect some deletion this week to alleviate the situation.
  • Recovered from the second CERN CEPH incident rather quickly after experts restarted affected systems.
  • No other major other issues apart from the usual storage or transfer related problems at sites.
  • Continuing the RAW/DRAW reprocessing campaign in data/tape carousel mode: finished with data18 and started with data17 (Feb 19), approaching 50% done. Not using SARA for the moment and staging its share from CERN/CTA. Expect to start data16 in around 2 weeks
  • Started with TPC testing using webdav in production with a few sites - uncovered a couple of storage related problems and stopped until these are fixed.
  • Grand unification of PanDA queues on-going.

CMS

  • global run with first phase-2 detector configuration in mid February
  • running at about 265k cores during last month
    • usual production/analysis mix (80%/20%)
    • ultra-legacy re-reconstruction of 2017 and 2018 data in progress
    • large pre-UL and UL Monte Carlo activities
  • CERN CEPH outage and IPv6 networking issues
    • manual intervention to recover
    • global pool managers moved to Fermilab (revealed performance impact of monitoring queries)
    • at fortunate times thus not too disruptive

Comments

  • CMS would appreciate followup for IPv6 connectivity issues with web services at CERN. Julia will follow up.

LHCb

  • business as usual, ~100k concurrent jobs
  • Tier0+Tier1: running legacy stripping of Run1+2 pp collision data, 2016 nearly done, 2017 just started, all other years completed
    • 2016 data stripping impacted by CERN CEPH incident of February 20th, but eventually managed to recover manually
  • stripping of p-Ion and Ion-Ion collision data will follow
  • MC production running elsewhere

Task Forces and Working Groups

Upgrade of the T1 storage instances for TPC

GDPR and WLCG services

Accounting TF

  • Validation of the monthly accounting data is performed every month since November. We kindly ask site support teams to spend couple of minutes monthly on validation in order to improve accounting data quality and fixing eventual issues with APEL and WSSA.

  • CRIC is taking over from EGI accounting portal generation of monthly accounting reports. January accounting report generated by CRIC for T1s has been sent around for people to check and send their feedback. CRIC accounting reports will become official starting from summer this year. By this time REBUS will retire.

Archival Storage WG

Containers WG

CREAM migration TF

See the presentation

Documentation: CreamMigrationTaskForce

dCache upgrade TF

  • Good progress. Out of 45 dCache sites 34 have upgraded. 11 still to go. Still experience an issue with SRR and waiting for the fix from dCache developers.

DPM upgrade TF

  • Close to accomplish. Out of 55 sites, 39 has upgraded and re-configured. 1 site is suspended from operations. 2 still to upgrade, 5 to re-configure. 3 sites migrated to a different solution. 5 are in process of migrating to a different solution.

Information System Evolution TF

  • REBUS retirement plan has been presented to the MB and agreed. In April REBUS will go in read-only mode. Manual injection of storage capacity and storage accounting data in REBUS will be replaced by the accounting validation in CRIC, which is already in production and is performed by sites monthly since November 2019. WLCG CRIC will become the primary data source for pledges and federation topology. REBUS will retire in the beginning of summer.

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


  • LHCOPN/LHCONE Asia (8-9th March) was re-scheduled to take place during HSF/WLCG workshop, but as the workshop is postponed there will be half-day virtual meeting instead (on 13th of May)
  • perfSONAR infrastructure status - 4.2.3 version was released this week
  • OSG/WLCG infrastructure
    • New dasboards were created to provide high-level overview of packet loss, throughput, latency and traceroutes (https://atlas-kibana.mwt2.org/s/networking/goto/20dd25907d61df98a0b85b1dfaed54e1)
    • Still work in progress but feedback is welcome
    • Dashboards are already showing some of the results of the analytical studies that were performed as part of SAND project
    • The aim is to make it easier to identify new issues that are not easy to spot by the experiments data management systems (network instabilities that could impact network performance in undeterministic ways).
  • Looking into dublin-traceroute (extension of paris-traceroute) and possible integration in perfSONAR
  • 100 Gbps perfSONAR testbed mailing list to join: http://cern.ch/simba3/SelfSubscription.aspx?groupName=wlcg-perfsonar-100g
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Traceability WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

  • Next meeting is scheduled for the 2nd of April
Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2020-03-12 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback