DRAFT
WLCG Operations Coordination Minutes, April 4, 2019
Highlights
Agenda
https://indico.cern.ch/event/810489/
Attendance
- local:
- remote:
- apologies:
Operations News
Special topics
Operational Intelligence
Site Questionnaire
Middleware News
- Useful Links
- Baselines/News
- From the ARC development team. Unless there is a very well justified reason please NOBODY should deploy an ARC 5.x CE any longer! ARC team is working hard on the next major release, the ARC6 series and that is the RECOMMENDED version to deploy as a new ARC CE installation. ARC 6 is currently in pre-release testing and already deployed on a couple of Nordic production sites : the official release is expected to be out very soon. All information about ARC6 is collected here
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- High activity on average
- No major issues
- CERN: 20k "ghost jobs" discovered on March 12, another 10k on March 20
- Not visible in MonALISA
- Running in the batch system, but some found stuck in
gdb
- The number of bad jobs steadily shrunk over 2 days
- Root cause not understood
- Prague: network issues affecting ALICE jobs since late Feb
- The site has needed to be blocked for many weeks
- Experts looking into it
ATLAS
- Smooth Grid production over the last weeks with ~300k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and analysis and a small fraction of dedicated data reprocessing. Some periods of additional HPC contributions with peaks of ~300k concurrently running job slots and ~15k jobs from Boinc.
- In the past weeks Distributed analysis was seriously affected by CPU steal on 2 of 5 PanDA/JEDI servers at CERN - fixed by CERN IT by reshuffling the VMs on the Hypervisors. This is the 3rd occurrence of serious CPU steal on CERN infrastructure in the past 6 months.
- Commissioning of the Harvester submission system via PanDA almost done apart from a handful of US sites. Commissioning of a new PanDA worker node pilot version on-going.
- DPM: Continuous data deletion problems at DPM sites: already observed in the past but e.g. in the last 30 days 15 GGUS tickets opened about deletion failures at DPM sites vs. 3 for all the rest. Out of the 15: 2 DOME DPM and 13 legacy DPM sites (but 8 of this are Apache related and would impact also the new DOMA version). We are using HTTP/webdav for deletions since it is much faster than any other method. Problem with apache modules on SL6 and CentOS7 ? See LCGDM-2699 and LCGDM-2783
- ATLAS and CERN Tape Archive (CTA): interesting and challenging work ongoing between ATLAS DDM experts and CTA experts to put in pre-production the first CTA version, while discussing possible optimization to write and recall files/datasets. Summary of these discussions will be brought to the WLCG Archival WG.
CMS
- smooth running, compute systems busy at about 250k cores
- usual production/analysis mix (80%/20%)
- first part of the parked B physics data staged back from tape
- expected processing time about two months
- heavy-ion data recorded last November need to be re-processed
- heavy tape data recall during the next couple of weeks
- 2017 and 2018 Monte Carlo production ongoing
- tape deletion campaign being prepared
- expect to start deletion in May
- EOS metedata lookup limit reached purging old log files
- in the process of changing directory organization to reduce/eliminate excessive file stats
LHCb
- Smooth running at ~100K jobs
- MC simulation, Data stripping and user analysis
- No major problems
Task Forces and Working Groups
GDPR and WLCG services
Accounting TF
Archival Storage WG
Update of providing tape info
PLEASE CHECK AND UPDATE THIS TABLE
Site |
Info enabled |
Plans |
Comments |
CERN |
YES |
|
|
BNL |
YES |
|
|
CNAF |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
FNAL |
YES |
|
|
IN2P3 |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
JINR |
YES |
|
|
KISTI |
YES |
|
KISTI has been contacted. Will work on in the second half of September |
KIT |
YES |
|
|
NDGF |
YES |
|
NDGF has a distributed storage which complicates the task. Storage usage publishing has been enabled, others will come later |
NLT1 |
YES |
|
Almost done, waiting for opening of the firewall, order of couple of days |
NRC-KI |
YES |
|
|
PIC |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
RAL |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
TRIUMF |
YES |
|
|
One can see all sites integrated in storage space accounting for tapes
here
Information System Evolution TF
IPv6 Validation and Deployment TF
Detailed status
here.
Machine/Job Features TF
Monitoring
MW Readiness WG
Network Throughput WG
Squid Monitoring and HTTP Proxy Discovery TFs
- IP address ranges are now being read from GOCDB. This is planned to be used by Web Proxy Auto Discovery to tell grid sites apart that share a GeoIP organization but have their own squids, so squids can be correctly assigned for WPAD requests from those sites. Currently an error is returned for this situation. This will not impact OSG sites, but there aren't many OSG sites that have this problem.
Traceability WG
Container WG
Action list
Specific actions for experiments
Specific actions for sites
AOB