LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes200402 (2020-04-06, LorneL)

EditAttachPDF

WLCG Operations Coordination Minutes, April 2, 2020

Highlights
Agenda
Attendance
Operations News
Special topics
Middleware News
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
- ALICE
- ATLAS
- CMS
  - Discussion
- LHCb
Task Forces and Working Groups
Action list
- Specific actions for experiments
- Specific actions for sites
AOB

Highlights

Agenda

link to the agenda page

Attendance

local:
remote: Aleksandr A (ATLAS), Alessandra D (Napoli), Alessandra F (ATLAS + Manchester), Alexander U (ATLAS), Alexandre Bonvin (Utrecht), Alexei, Andrea (WLCG), Andreas (KIT), Andrew (TRIUMF), Catalin (EGI), Cesare (MPCDF), Christoph (CMS), Concezio (LHCb), Costin (ALICE), Cécile Barbier, Dario (ATLAS), Dave M (FNAL), David B (IN2P3-CC), David Cameron (ATLAS), David Cohen (Technion), David S (ATLAS), Doug (ATLAS), Eric (IN2P3), Federico (LHCb), Felice (CMS), Giuseppe B (CMS), Giuseppe La Rocca (EGI), Ivan (ATLAS), James (ATLAS), Jeny (FNAL), Johannes (ATLAS), Julia (WLCG), Liz (FNAL + CMS), Maarten (ALICE + WLCG), Marco (Padova), Marian (monitoring + networks), Matt D (Lancaster), Matt V (EGI), Nicolo (ATLAS), Pepe (PIC), Peter (ATLAS), Petr (Prague + ATLAS), Renato (LHCb + CBPF + ROC_LA), Ricardo (SAMPA), Riccardo (WLCG), Rod (ATLAS), Ron (NLT1), Shawn (MWT2 + ATLAS), Stefano (CNAF), Stephan (CMS), Thomas (DESY), Torsten (Wuppertal), Victor (CMS), Vincent (security)
apologies:

Operations News

the next meeting is planned for May 7
- please let us know if that date would pose a major inconvenience

Special topics

COVID-19 impact on WLCG operations

Twiki page where input from sites, experiments, central operations and infrastructures is collected

WLCG computing resources for COVID-19 research

Note: FH denotes Folding at Home

Federico:
- not convinced running FH would be the best approach, other initiatives might be better
- would we run some amount some of it alongside LHCb workloads?
- also depends on the perspective of sites
- we can furnish our expertise in running workloads across the grid

James:
- sites can directly contribute to other initiatives
- FH is easy to integrate into our workflows
- experiments could direct such jobs to sites that agree

Thomas:
- are we sure there will be enough work to run?
- Rosetta at Home did not have enough so far
- the FH client is incompatible with other BOINC work!

Federico: as we cannot know the queue, pilots may just die

Andreas:
- we should provide a list of running projects
- sites then can pick before experiments would try and do something
- KIT already doing that for resources above the pledge

James:
- there are docs in a number of places
- the CERN task force has concluded that FH would be the best option so far

David S: we should not just run what is possible, it has to be useful

Julia:
- in principle we could even run a service creating such jobs
- the usefulness of that is not known today

Dave M:
- we would need to interact with experts of those domains
- OSG and EGI are also running initiatives
- FNAL is already involved there

Federico:
- EGI are e.g. already running WeNMR (see the presentation)
- WLCG lack expertise in those areas

Matt V:
- Alexandre Bonvin will talk about WeNMR
- EGI will have a call with OSG and come back to WLCG

David Cohen:
- sites will need to know what resource numbers we are talking about
- they may need to get agreement from funding agencies

Julia: indeed, and we should find the most effective contributions

Pepe:
- resources are to be used for official purposes
- there is more flexibility for amortized and other HW beyond pledge

Liz:
- different countries and funding agencies will have different policies
- sites should talk to their funding agencies

Alessandra F: WLCG cannot enforce anything

Christoph:
- what sites do with resources beyond pledge is their decision
- for running jobs in question on pledged resources we would need to know:
  - what fraction?
  - which application(s)?
  - through which channel(s)?

Alessandra F:
- the best application is currently unknown
- here we want to decide what we can do using the experiment infrastructures
- and avoid unnecessary duplication of efforts

Costin: an experiment can reach all its sites

Federico:
- it is not for us to operate the application(s)
- biomed people should do that

Alessandra F: some interaction with people from WHO etc. might be needed

James:
- the CERN task force are doing that
- for now, FH was the only concrete proposal

Costin: in order not to waste effort, can we go ahead?

Maarten:
- we have to be careful there
- small-scale proofs of concept are OK at this stage
- bigger activities could e.g. lead to issues between sites and funding agencies
- we do not have a full plan at this time

Johannes:
- in the experiments we can control the scale of these activities
- and we could already use unpledged resources like the online farm

Christoph:
- experiments cannot control the use of unpledged resources at sites
- several sites are already using unpledged resources for related purposes

David S:
- we can come to a suggestion for how to run things
- and avoid unnecessary duplication

Dave M: WLCG can do the communication part

Julia: we will follow up in our own task force

Follow up comments after the meeting

Simone Campana could not join the meeting because of overlap with another meeting he had to attend. There are a few follow-up comments/clarifications from Simone:

Why FH? As James Catmore pointed out, WLCG picked FH because the Fight-against-COVID TF at CERN indicated so, while more discussions there are happening.
Concerns of the sites if they use resources allocated to LHC for COVID-19 research. At the moment we agreed to do this at Citizen Science level, again, as recommended by the TF. So a few thousand cores. Even if this is 10k cores, this is 1% of WLCG, so we do not expect an impact on WLCG activities considering also that normally there is a 20% beyond pledge the experiments benefit from. I will mention this activity at the next RRB and ask the Funding Agencies for feedback. The situation is different if a site or a country decided to dedicate a large fraction of the resources to some initiative. There, we give flexibility but that site or that group of sites should document it and explain it to the funding agency.

EGI initiatives. HADDOCK application.

presentation

Alexandre:
- we are talking to OSG to see if our jobs can run there as well
- our computing model has been opportunistic so far
- sites decide if they want to support us e.g. for backfilling
- the work volume depends on the user activity
- it also is limited by the scalability of the portal(s)

James: have you contacted the CERN task force?

Alexandre:
- not yet
- at this time we are not limited by computing resources
- we can flag jobs that are related to COVID-19 research

Andreas: whom to approach for such jobs?

Alexandre:
- first enable the enmr.eu VO on your resources
- we do not depend on CVMFS today, as we found it unreliable at several sites
- instead, our jobs bring their payload of 1 to 20 MB in their input sandbox
- the job output is typically around 5 to 20 MB
- jobs have typically been short
- through DIRAC we can make them longer with larger outputs
- each site supporting these jobs will need to be enabled in DIRAC
- if desired, the site can be tagged to receive only jobs related to COVID-19

Julia: in the meeting between EGI and OSG, is there a WLCG representative?

Matt V: not yet, but we will follow up on that

Middleware News

Useful Links
Baselines/News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

Mostly business as usual so far, despite COVID-19 measures everywhere!
Thanks very much to the site admins!
Current emphasis is on data analysis, which requires little additional disk space.
Productions that need a lot of disk space are postponed until pledges are available.

ATLAS

no Covid-19 related problems so far
Smooth and stable Grid production with ~430k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~90k slots from the CERN-P1 farm. Occasional additional bursts of ~100k jobs from NERSC/Cori.
Finishing the RAW/DRAW reprocessing campaign in data/tape carousel mode with data15 within the next week.
No other major other issues apart from the usual storage or transfer related problems at sites.
Feedback on APEL accounting question: keep it simple !
Grand unification of PanDA queues on-going and test of non-gridFTP TPC tests in production
Feedback on Google CA bundle for TPC to GCS: CloudStorageIntegration - will move ahead with it.
Would like to raise criticality of services for CEPH and DBoD to 8,9

CMS

no Covid-19 related interrupts to the CMS computing infrastructure so far
jumbo frame issue at CERN impacting several sites, INC:2355684
- after network maintenance, March 11th, OTG:0054668
- we expected this to be corrected quickly, does anybody know what the issue is?
running at about 250k cores during last month
- usual production/analysis mix (80%/20%)
- ultra-legacy re-reconstruction of 2016 in validation
- Run 2 Monte Carlo production is largest activity, large batch of Phase-2 events delivered

Discussion

Maarten:
- the matter with the jumbo frames seems to be not so easy
- the ticket is currently waiting for input from the affected site
Liz:
- an issue with jumbo frames already hit us in the middle of Run 2
- this could be wider than 1 site
Stephan:
- at the moment this is not a big problem, affecting a limited area of work
- we would like to have a solution, even if it implies changes on our side

LHCb

see here

Task Forces and Working Groups

GDPR and WLCG services

Updated list of services
Detailed discussion how we go to enable privacy notice for all our services has been postponed. We will have a dedicated meeting with experiment contacts most probably next Thursday

Accounting TF

T1 reports generated by CRIC were sent around for validation, T2 reports will be sent for March

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

90 tickets
5 done: 2 ARC, 3 HTCondor
18 sites plan for ARC, 12 are considering it
22 sites plan for HTCondor, 14 are considering it, 7 consider using SIMPLE
15 tickets on hold, to be continued in a number of months
14 tickets without reply
- response times possibly affected by COVID-19 measures

dCache upgrade TF

34 sites are running versions > 5.2.0

http://wlcg-cric.cern.ch/core/service/list/?type=se&show_5=0&show_6=1&state=ACTIVE&impl=dcache&version=5.

9 to go, some of them planned an upgrade , but postponed it due to COVID-19
2 plans to move to DPM

Discussion

Maarten: nowadays it does not seem a good idea for sites to move to DPM

Julia: we will follow up with them

Stephan:
- one of those sites is a CMS site that already had a DPM
- they want to consolidate their grid storage into just one system

DPM upgrade TF

34 sites upgraded and reconfigured with DOME

http://wlcg-cric.cern.ch/core/service/list/?type=se&show_5=0&show_6=1&state=ACTIVE&impl=dpm&version=DOME&show_11=0&show_18=0

Out of those 15 are running 1.13.2 with DOME

6 upgraded but DOME is missing, but they are working on it
1 to upgrade and re-configure, in progress
1 site is suspended for operations
9 moving away from DPM

Information System Evolution TF

REBUS is in readonly mode since beginning of April. Pages for editing information have been redirected to CRIC
Thanks a lot to Federico for providing API from Dirac for LHCb topology information. Will be used by CRIC and Storage Space Accounting

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG

perfSONAR infrastructure status - 4.2.3 and 4.2.4 versions were released
- Release notes: https://www.perfsonar.net/release-notes/version-4-2-3/ https://www.perfsonar.net/releasenotes-2020-04-01-4-2-4.html
- Missing throughput/trace tests were reported after updating to 4.2.3, this should be fixed in 4.2.4
OSG/WLCG infrastructure
- New dashboards are now available providing high-level overview of packet loss, throughput, latency and traceroutes (https://atlas-kibana.mwt2.org/s/networking/goto/20dd25907d61df98a0b85b1dfaed54e1)
- The aim is to make it easier to identify new issues that are not easy to spot by the experiments data management systems (network instabilities that could impact network performance).
- Started identifying interesting cases showing up in the new dashboards, documenting them and following up
- ESnet (router) traffic feed now available, working on its integration to our pipeline
- Also started working on integration of the OSG HTCondor jobs statistics (network related) - will be added to our pipeline and stream
100 Gbps perfSONAR testbed mailing list to join: http://cern.ch/simba3/SelfSubscription.aspx?groupName=wlcg-perfsonar-100g
WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Traceability WG

Action list

Creation date	Description	Responsible	Status	Comments

Specific actions for experiments

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

Specific actions for sites

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

AOB

Topic revision: r27 - 2020-04-06 - LorneL

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback