WLCG Operations Coordination Minutes, April 4, 2019
Highlights
Agenda
https://indico.cern.ch/event/810489/
Attendance
- local: Alessandro (ATLAS), Julia (WLCG), Maarten (ALICE + WLCG), Stephan (CMS)
- remote: Alessandra D (Napoli), Catherine (IN2P3 + LPSC), Dave (FNAL), Di (TRIUMF), Dmitry (?), Eric (IN2P3-CC), Felix (ASGC), Johannes (ATLAS), Konrad (LHCb), Marcelo (CNAF), Panos (WLCG), Ron (NLT1)
- apologies:
Operations News
- WLCG Operations together with the DOMA QoS prepares the questionnaire to be sent to the sites in the coming weeks. We are counting on the experiments to engage sites and help us to ensure that they take part in the survey.
- The joined WLCG-EGI workshop dedicated to the migration of sites from CREAM to HTCondor, ARC or no-CE solutions will take place on the 7th of May during the EGI conference in Amsterdam. The agenda page is not yet available. We will send the link around as soon as it is ready. We invite sites who are planning migration from CREAM CE to join the workshop at least remotely.
- The next meeting is planned for May 16
- Please let us know if that date would be very inconvenient
Special topics
Operational Intelligence
See the
presentation
Discussion
No questions
Site Questionnaire
See the
presentation
Discussion
- Alessandro:
- to be able to help better, the experiments should see the proposed questions
- are T3 sites also going to be targeted?
- Julia:
- we will involve the experiments regarding the questions
- T3 sites could be included, while the main focus is on the T2 sites
Middleware News
- Useful Links
- Baselines/News
- From the ARC development team. Unless there is a very well justified reason please NOBODY should deploy an ARC 5.x CE any longer! ARC team is working hard on the next major release, the ARC6 series and that is the RECOMMENDED version to deploy as a new ARC CE installation. ARC 6 is currently in pre-release testing and already deployed on a couple of Nordic production sites : the official release is expected to be out very soon. All information about ARC6 is collected here
- There was a discussion about increasing the baseline frontier-squid to version 3.5.28-3.1. Dave Dykstra will send the proposal & rationale to the wlcg-ops-coordination mailing list to discuss it further there.
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- High activity on average
- No major issues
- CERN: 20k "ghost jobs" discovered on March 12, another 10k on March 20
- Not visible in MonALISA
- Running in the batch system, but some found stuck in
gdb
- The number of bad jobs steadily shrunk over 2 days
- Root cause not understood
- Prague: network issues affecting ALICE jobs since late Feb
- The site has needed to be blocked for many weeks
- Experts looking into it
ATLAS
- Smooth Grid production over the last weeks with ~300k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and analysis and a small fraction of dedicated data reprocessing. Some periods of additional HPC contributions with peaks of ~300k concurrently running job slots and ~15k jobs from Boinc.
- In the past weeks Distributed analysis was seriously affected by CPU steal on 2 of 5 PanDA/JEDI servers at CERN - fixed by CERN IT by reshuffling the VMs on the Hypervisors. This is the 3rd occurrence of serious CPU steal on CERN infrastructure in the past 6 months.
- Commissioning of the Harvester submission system via PanDA almost done apart from a handful of US sites. Commissioning of a new PanDA worker node pilot version on-going.
- DPM: Continuous data deletion problems at DPM sites: already observed in the past but e.g. in the last 30 days 15 GGUS tickets opened about deletion failures at DPM sites vs. 3 for all the rest. Out of the 15: 2 DOME DPM and 13 legacy DPM sites (but 8 of this are Apache related and would impact also the new DOMA version). We are using HTTP/webdav for deletions since it is much faster than any other method. Problem with apache modules on SL6 and CentOS7 ? See LCGDM-2699 and LCGDM-2783
- ATLAS and CERN Tape Archive (CTA): interesting and challenging work ongoing between ATLAS DDM experts and CTA experts to put in pre-production the first CTA version, while discussing possible optimization to write and recall files/datasets. Summary of these discussions will be brought to the WLCG Archival WG.
Discussion
- Stephan: CMS also had deletion issues at 2 DPM sites
- Johannes: as such issues have been ongoing for quite a while,
we finally decided to raise the matter here
- Julia: we will follow up with the DPM devs (done)
CMS
- smooth running, compute systems busy at about 250k cores
- usual production/analysis mix (80%/20%)
- first part of the parked B physics data staged back from tape
- expected processing time about two months
- heavy-ion data recorded last November need to be re-processed
- heavy tape data recall during the next couple of weeks
- 2017 and 2018 Monte Carlo production ongoing
- tape deletion campaign being prepared
- expect to start deletion in May
- EOS metadata lookup limit reached purging old log files
- in the process of changing directory organization to reduce/eliminate excessive file stats
LHCb
- Smooth running at ~100K jobs
- MC simulation, Data stripping and user analysis
- No major problems
Task Forces and Working Groups
GDPR and WLCG services
Accounting TF
- Accounting session has been organized during the last GDB. The following topics were discussed: APEL status and plans, EGI accounting portal status and plans, HTCondor accounting, CERN accounting, accounting data validation.
- The generic solution for HTCondor CE accounting with APEL is ready to be validated by the sites who are planning to migrate to HTCondor CE. Please, follow up the documentation. In case you need help, please, contact Stephen Jones ( sjones@hepNOSPAMPLEASE.ph.liv.ac.uk ).
Archival Storage WG
Update of providing tape info
PLEASE CHECK AND UPDATE THIS TABLE
Site |
Info enabled |
Plans |
Comments |
CERN |
YES |
|
|
BNL |
YES |
|
|
CNAF |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
FNAL |
YES |
|
|
IN2P3 |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
JINR |
YES |
|
|
KISTI |
YES |
|
KISTI has been contacted. Will work on in the second half of September |
KIT |
YES |
|
|
NDGF |
YES |
|
NDGF has a distributed storage which complicates the task. Storage usage publishing has been enabled, others will come later |
NLT1 |
YES |
|
Almost done, waiting for opening of the firewall, order of couple of days |
NRC-KI |
YES |
|
|
PIC |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
RAL |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
TRIUMF |
YES |
|
|
One can see all sites integrated in storage space accounting for tapes
here
Information System Evolution TF
IPv6 Validation and Deployment TF
Detailed status
here.
Machine/Job Features TF
DPM Upgrade Task Force
- Epel testing of the DPM version 1.12 is finishing. The new release will be promoted on Monday. Meanwhile, the pioneer sites Brunel, Lancaster, Prague, Beijing, IRFU upgraded to 1.12. Those participating to the DOMA TPC campaign are showing very good results.
Monitoring
MW Readiness WG
Network Throughput WG
Squid Monitoring and HTTP Proxy Discovery TFs
- IP address ranges are now being read from GOCDB. This is planned to be used by Web Proxy Auto Discovery to tell grid sites apart that share a GeoIP organization but have their own squids, so squids can be correctly assigned for WPAD requests from those sites. Currently an error is returned for this situation. This will not impact OSG sites, but there aren't many OSG sites that have this problem.
Traceability WG
Container WG
Action list
Specific actions for experiments
Specific actions for sites
AOB