WLCG Operations Coordination Minutes, July 2, 2020

Highlights

Agenda

https://indico.cern.ch/event/933764/

Attendance

  • local:
  • remote: Alberto (monitoring), Andrew (TRIUMF), Borja (monitoring), Concezio (LHCb), Dave (FNAL), David (Technion), Felix (ASGC), Gavin (T0), Giuseppe (CMS), Horst (Oklahoma), Johannes (ATLAS), Maarten (ALICE + WLCG), Matt (Lancaster), Nikolay (monitoring), Pedro (monitoring), Stephan (CMS), Vincent (security)
  • apologies:

Operations News

  • The next meeting is planned for Sep 3

Discussion

  • Concezio:
    • the MJF functionality is not critical for LHCb at WLCG sites
    • we do depend on it for Vac and cloud setups, though
  • Maarten:
    • WLCG Ops Coordination is concerned with activities that potentially are
      of interest to more than 1 experiment
    • for MJF that looked the case a few years ago, but in the end only LHCb
      wanted to pursue that functionality
    • we just move the TF to the page listing the closed TFs (done),
      all its materials remain available for continued use

Special topics

WLCG Critical Services proposal followup

  • There were no objections to go ahead with proposed changes,
    which can still be fine-tuned further as we implement them
  • To be finalized by autumn

CERN Grid CA OCSP incident

  • After a scheduled intervention on June 24, the CERN Grid CA OCSP service
    became inaccessible from outside CERN (OTG:0057432)
  • Requests to the service were dropped by the CERN perimeter firewall
  • A CREAM CE will try to check a client certificate's status via OCSP,
    if the existence of such an endpoint is indicated in the certificate details
    • It appears other CE flavors rely on CRLs only and just ignore OCSP services
  • Checks of CERN Grid CA certificates were then hanging until a timeout was reached
  • The CREAM client code would time out first, thus failing job submissions that
    used CERN Grid CA certificates
  • This affected the 4 experiments and, through the SAM tests, sites running CREAM
    • Some A/R recomputations may be needed
  • The service was restored about 24h later on June 25
  • Some improvements are foreseen to make reoccurrence a lot less probable

  • Further details here

Discussion

  • Maarten: the trouble was that requests were not failing quickly;
    as far as we know, if e.g. the connection is refused, it is not fatal

  • Stephan: is only CREAM concerned? we saw a ticket implicating an Xrootd service
  • Maarten: please send me the details and I will look into the need for followup
    • after the meeting: it probably was a mistaken inference

SAM migration progress

see the presentation

  • Also see the GDB presentation that will happen next week
  • There were no comments

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Mostly business as usual, no major issues

ATLAS

  • Stable Grid production with up to ~380k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~45k slots from the HLT/Sim@CERN-P1 farm and ~15k slots from Boinc. Occasional additional peaks of 200k job slots from HPCs.
  • Continuing with about 60k job slots used for Folding@Home jobs since 4 April. 50% from ~55 different grid sites via opt-in and 50% at CERN-P1
  • No other major other issues apart from the usual storage or transfer related problems at sites
  • Finishing grand unification of production+analysis queues in PanDA in the next days.
  • All systems recovered quickly from Oracle/DBonDemand downtime last Saturday - would appreciate to avoid such downtimes over the weekend next time
  • CTA in production for ATLAS since Monday - still fixing some issues in Rucio/middleware

CMS

  • Covid-19 compute contributions being returned to experiment use
  • main processing activities:
    • Run 2 ultra-legacy Monte Carlo
    • Run 2 pre-UL Monte Carlo
  • migration to Rucio ongoing
    • production of nanoAOD samples configured for PhEDEx being bumped up to complete more quickly

LHCb

  • still running F@H on part of HLT farm
  • large MC requests coming up so we are going to reduce this Covid-19-related activity
  • processing (small) samples of lead-lead collisions and lead-neon fixed target collisions
  • grid drained in preparation for the CERN Oracle/DBOD outage of last Saturday, DIRAC services and agents switched off, then on again after the outage, everything went extremely smoothly

Discussion on F@H reductions

  • Maarten: it is perfectly defensible to ramp down resources for F@H,
    as we have already done a lot and we cannot neglect our own duties

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

  • 90 tickets
  • 14 done: 7 ARC, 7 HTCondor
  • 16 sites plan for ARC, 15 are considering it
  • 20 sites plan for HTCondor, 14 are considering it, 8 consider using SIMPLE
  • 14 tickets on hold, to be continued in the coming weeks / months
  • 7 tickets without reply
    • response times possibly affected by COVID-19 measures

dCache upgrade TF

DPM upgrade TF

StoRM upgrade TF

Information System Evolution TF

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


Traceability WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2020-07-03 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback