WLCG Operations Coordination Minutes, April 2, 2020
Highlights
Agenda
Attendance
- local:
- remote: Aleksandr A (ATLAS), Alessandra D (Napoli), Alessandra F (ATLAS + Manchester), Alexander U (ATLAS), Alexandre Bonvin (Utrecht), Alexei, Andrea (WLCG), Andreas (KIT), Andrew (TRIUMF), Catalin (EGI), Cesare (MPCDF), Christoph (CMS), Concezio (LHCb), Costin (ALICE), Cécile Barbier, Dario (ATLAS), Dave M (FNAL), David B (IN2P3-CC), David Cameron (ATLAS), David Cohen (Technion), David S (ATLAS), Doug (ATLAS), Eric (IN2P3), Federico (LHCb), Felice (CMS), Giuseppe B (CMS), Giuseppe La Rocca (EGI), Ivan (ATLAS), James (ATLAS), Jeny (FNAL), Johannes (ATLAS), Julia (WLCG), Liz (FNAL + CMS), Maarten (ALICE + WLCG), Marco (Padova), Marian (monitoring + networks), Matt D (Lancaster), Matt V (EGI), Nicolo (ATLAS), Pepe (PIC), Peter (ATLAS), Petr (Prague + ATLAS), Renato (LHCb + CBPF + ROC_LA), Ricardo (SAMPA), Riccardo (WLCG), Rod (ATLAS), Ron (NLT1), Shawn (MWT2 + ATLAS), Stefano (CNAF), Stephan (CMS), Thomas (DESY), Torsten (Wuppertal), Victor (CMS), Vincent (security)
- apologies:
Operations News
- the next meeting is planned for May 7
- please let us know if that date would pose a major inconvenience
Special topics
COVID-19 impact on WLCG operations
WLCG computing resources for COVID-19 research
Note: FH denotes Folding at Home
- Federico:
- not convinced running FH would be the best approach, other initiatives might be better
- would we run some amount some of it alongside LHCb workloads?
- also depends on the perspective of sites
- we can furnish our expertise in running workloads across the grid
- James:
- sites can directly contribute to other initiatives
- FH is easy to integrate into our workflows
- experiments could direct such jobs to sites that agree
- Thomas:
- are we sure there will be enough work to run?
- Rosetta at Home did not have enough so far
- the FH client is incompatible with other BOINC work!
- Federico: as we cannot know the queue, pilots may just die
- Andreas:
- we should provide a list of running projects
- sites then can pick before experiments would try and do something
- KIT already doing that for resources above the pledge
- James:
- there are docs in a number of places
- the CERN task force has concluded that FH would be the best option so far
- David S: we should not just run what is possible, it has to be useful
- Julia:
- in principle we could even run a service creating such jobs
- the usefulness of that is not known today
- Dave M:
- we would need to interact with experts of those domains
- OSG and EGI are also running initiatives
- FNAL is already involved there
- Federico:
- EGI are e.g. already running WeNMR (see the presentation)
- WLCG lack expertise in those areas
- Matt V:
- Alexandre Bonvin will talk about WeNMR
- EGI will have a call with OSG and come back to WLCG
- David Cohen:
- sites will need to know what resource numbers we are talking about
- they may need to get agreement from funding agencies
- Julia: indeed, and we should find the most effective contributions
- Pepe:
- resources are to be used for official purposes
- there is more flexibility for amortized and other HW beyond pledge
- Liz:
- different countries and funding agencies will have different policies
- sites should talk to their funding agencies
- Alessandra F: WLCG cannot enforce anything
- Christoph:
- what sites do with resources beyond pledge is their decision
- for running jobs in question on pledged resources we would need to know:
- what fraction?
- which application(s)?
- through which channel(s)?
- Alessandra F:
- the best application is currently unknown
- here we want to decide what we can do using the experiment infrastructures
- and avoid unnecessary duplication of efforts
- Costin: an experiment can reach all its sites
- Federico:
- it is not for us to operate the application(s)
- biomed people should do that
- Alessandra F: some interaction with people from WHO etc. might be needed
- James:
- the CERN task force are doing that
- for now, FH was the only concrete proposal
- Costin: in order not to waste effort, can we go ahead?
- Maarten:
- we have to be careful there
- small-scale proofs of concept are OK at this stage
- bigger activities could e.g. lead to issues between sites and funding agencies
- we do not have a full plan at this time
- Johannes:
- in the experiments we can control the scale of these activities
- and we could already use unpledged resources like the online farm
- Christoph:
- experiments cannot control the use of unpledged resources at sites
- several sites are already using unpledged resources for related purposes
- David S:
- we can come to a suggestion for how to run things
- and avoid unnecessary duplication
- Dave M: WLCG can do the communication part
- Julia: we will follow up in our own task force
Follow up comments after the meeting
Simone Campana could not join the meeting because of overlap with another meeting he had to attend. There are a few follow-up comments/clarifications from Simone:
- Why FH? As James Catmore pointed out, WLCG picked FH because the Fight-against-COVID TF at CERN indicated so, while more discussions there are happening.
- Concerns of the sites if they use resources allocated to LHC for COVID-19 research. At the moment we agreed to do this at Citizen Science level, again, as recommended by the TF. So a few thousand cores. Even if this is 10k cores, this is 1% of WLCG, so we do not expect an impact on WLCG activities considering also that normally there is a 20% beyond pledge the experiments benefit from. I will mention this activity at the next RRB and ask the Funding Agencies for feedback. The situation is different if a site or a country decided to dedicate a large fraction of the resources to some initiative. There, we give flexibility but that site or that group of sites should document it and explain it to the funding agency.
EGI initiatives. HADDOCK application.
presentation
- Alexandre:
- we are talking to OSG to see if our jobs can run there as well
- our computing model has been opportunistic so far
- sites decide if they want to support us e.g. for backfilling
- the work volume depends on the user activity
- it also is limited by the scalability of the portal(s)
- James: have you contacted the CERN task force?
- Alexandre:
- not yet
- at this time we are not limited by computing resources
- we can flag jobs that are related to COVID-19 research
- Andreas: whom to approach for such jobs?
- Alexandre:
- first enable the
enmr.eu
VO on your resources
- we do not depend on CVMFS today, as we found it unreliable at several sites
- instead, our jobs bring their payload of 1 to 20 MB in their input sandbox
- the job output is typically around 5 to 20 MB
- jobs have typically been short
- through DIRAC we can make them longer with larger outputs
- each site supporting these jobs will need to be enabled in DIRAC
- if desired, the site can be tagged to receive only jobs related to COVID-19
- Julia: in the meeting between EGI and OSG, is there a WLCG representative?
- Matt V: not yet, but we will follow up on that
Middleware News
- Useful Links
- Baselines/News
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- Mostly business as usual so far, despite COVID-19 measures everywhere!
- Thanks very much to the site admins!
- Current emphasis is on data analysis, which requires little additional disk space.
- Productions that need a lot of disk space are postponed until pledges are available.
ATLAS
- no Covid-19 related problems so far
- Smooth and stable Grid production with ~430k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~90k slots from the CERN-P1 farm. Occasional additional bursts of ~100k jobs from NERSC/Cori.
- Finishing the RAW/DRAW reprocessing campaign in data/tape carousel mode with data15 within the next week.
- No other major other issues apart from the usual storage or transfer related problems at sites.
- Feedback on APEL accounting question: keep it simple !
- Grand unification of PanDA queues on-going and test of non-gridFTP TPC tests in production
- Feedback on Google CA bundle for TPC to GCS: CloudStorageIntegration - will move ahead with it.
- Would like to raise criticality of services for CEPH and DBoD to 8,9
CMS
- no Covid-19 related interrupts to the CMS computing infrastructure so far
- jumbo frame issue at CERN impacting several sites, INC:2355684
- after network maintenance, March 11th, OTG:0054668
- we expected this to be corrected quickly, does anybody know what the issue is?
- running at about 250k cores during last month
- usual production/analysis mix (80%/20%)
- ultra-legacy re-reconstruction of 2016 in validation
- Run 2 Monte Carlo production is largest activity, large batch of Phase-2 events delivered
Discussion
- Maarten:
- the matter with the jumbo frames seems to be not so easy
- the ticket is currently waiting for input from the affected site
- Liz:
- an issue with jumbo frames already hit us in the middle of Run 2
- this could be wider than 1 site
- Stephan:
- at the moment this is not a big problem, affecting a limited area of work
- we would like to have a solution, even if it implies changes on our side
LHCb
see
here
Task Forces and Working Groups
GDPR and WLCG services
- Updated list of services
- Detailed discussion how we go to enable privacy notice for all our services has been postponed. We will have a dedicated meeting with experiment contacts most probably next Thursday
Accounting TF
- T1 reports generated by CRIC were sent around for validation, T2 reports will be sent for March
Archival Storage WG
Containers WG
CREAM migration TF
Details
here
Summary:
- 90 tickets
- 5 done: 2 ARC, 3 HTCondor
- 18 sites plan for ARC, 12 are considering it
- 22 sites plan for HTCondor, 14 are considering it, 7 consider using SIMPLE
- 15 tickets on hold, to be continued in a number of months
- 14 tickets without reply
- response times possibly affected by COVID-19 measures
dCache upgrade TF
- 34 sites are running versions > 5.2.0
http://wlcg-cric.cern.ch/core/service/list/?type=se&show_5=0&show_6=1&state=ACTIVE&impl=dcache&version=5.
- 9 to go, some of them planned an upgrade , but postponed it due to COVID-19
- 2 plans to move to DPM
Discussion
- Maarten: nowadays it does not seem a good idea for sites to move to DPM
- Julia: we will follow up with them
- Stephan:
- one of those sites is a CMS site that already had a DPM
- they want to consolidate their grid storage into just one system
DPM upgrade TF
- 34 sites upgraded and reconfigured with DOME
http://wlcg-cric.cern.ch/core/service/list/?type=se&show_5=0&show_6=1&state=ACTIVE&impl=dpm&version=DOME&show_11=0&show_18=0
Out of those 15 are running 1.13.2 with DOME
- 6 upgraded but DOME is missing, but they are working on it
- 1 to upgrade and re-configure, in progress
- 1 site is suspended for operations
- 9 moving away from DPM
Information System Evolution TF
- REBUS is in readonly mode since beginning of April. Pages for editing information have been redirected to CRIC
- Thanks a lot to Federico for providing API from Dirac for LHCb topology information. Will be used by CRIC and Storage Space Accounting
IPv6 Validation and Deployment TF
Detailed status
here.
Machine/Job Features TF
Monitoring
MW Readiness WG
Network Throughput WG
- perfSONAR infrastructure status - 4.2.3 and 4.2.4 versions were released
- OSG/WLCG infrastructure
- New dashboards are now available providing high-level overview of packet loss, throughput, latency and traceroutes (https://atlas-kibana.mwt2.org/s/networking/goto/20dd25907d61df98a0b85b1dfaed54e1)
- The aim is to make it easier to identify new issues that are not easy to spot by the experiments data management systems (network instabilities that could impact network performance).
- Started identifying interesting cases showing up in the new dashboards, documenting them and following up
- ESnet (router) traffic feed now available, working on its integration to our pipeline
- Also started working on integration of the OSG HTCondor jobs statistics (network related) - will be added to our pipeline and stream
- 100 Gbps perfSONAR testbed mailing list to join: http://cern.ch/simba3/SelfSubscription.aspx?groupName=wlcg-perfsonar-100g
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Traceability WG
Action list
Specific actions for experiments
Specific actions for sites
AOB
Topic revision: r27 - 2020-04-06
- LorneL