DRAFT

WLCG Operations Coordination Minutes, April 4, 2019

Highlights

Agenda

https://indico.cern.ch/event/810489/

Attendance

  • local:
  • remote:
  • apologies:

Operations News

Special topics

Operational Intelligence

Site Questionnaire

Middleware News

  • Useful Links
  • Baselines/News
    • From the ARC development team. Unless there is a very well justified reason please NOBODY should deploy an ARC 5.x CE any longer! ARC team is working hard on the next major release, the ARC6 series and that is the RECOMMENDED version to deploy as a new ARC CE installation. ARC 6 is currently in pre-release testing and already deployed on a couple of Nordic production sites : the official release is expected to be out very soon. All information about ARC6 is collected here

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • High activity on average
  • No major issues
  • CERN: 20k "ghost jobs" discovered on March 12, another 10k on March 20
    • Not visible in MonALISA
    • Running in the batch system, but some found stuck in gdb
    • The number of bad jobs steadily shrunk over 2 days
    • Root cause not understood
  • Prague: network issues affecting ALICE jobs since late Feb
    • The site has needed to be blocked for many weeks
    • Experts looking into it

ATLAS

  • Smooth Grid production over the last weeks with ~300k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and analysis and a small fraction of dedicated data reprocessing. Some periods of additional HPC contributions with peaks of ~300k concurrently running job slots and ~15k jobs from Boinc.
  • In the past weeks Distributed analysis was seriously affected by CPU steal on 2 of 5 PanDA/JEDI servers at CERN - fixed by CERN IT by reshuffling the VMs on the Hypervisors. This is the 3rd occurrence of serious CPU steal on CERN infrastructure in the past 6 months.
  • Commissioning of the Harvester submission system via PanDA almost done apart from a handful of US sites. Commissioning of a new PanDA worker node pilot version on-going.
  • DPM: Continuous data deletion problems at DPM sites: already observed in the past but e.g. in the last 30 days 15 GGUS tickets opened about deletion failures at DPM sites vs. 3 for all the rest. Out of the 15: 2 DOME DPM and 13 legacy DPM sites (but 8 of this are Apache related and would impact also the new DOMA version). We are using HTTP/webdav for deletions since it is much faster than any other method. Problem with apache modules on SL6 and CentOS7 ? See LCGDM-2699 and LCGDM-2783
  • ATLAS and CERN Tape Archive (CTA): interesting and challenging work ongoing between ATLAS DDM experts and CTA experts to put in pre-production the first CTA version, while discussing possible optimization to write and recall files/datasets. Summary of these discussions will be brought to the WLCG Archival WG.

CMS

  • smooth running, compute systems busy at about 250k cores
    • usual production/analysis mix (80%/20%)
  • first part of the parked B physics data staged back from tape
    • expected processing time about two months
  • heavy-ion data recorded last November need to be re-processed
    • heavy tape data recall during the next couple of weeks
  • 2017 and 2018 Monte Carlo production ongoing
  • tape deletion campaign being prepared
    • expect to start deletion in May
  • EOS metedata lookup limit reached purging old log files
    • in the process of changing directory organization to reduce/eliminate excessive file stats

LHCb

  • Smooth running at ~100K jobs
    • MC simulation, Data stripping and user analysis
  • No major problems

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

Archival Storage WG

Update of providing tape info

PLEASE CHECK AND UPDATE THIS TABLE
Site Info enabled Plans Comments
CERN YES    
BNL YES    
CNAF YES   Space accounting info is integrated in the portal. Other metrics are on the way
FNAL YES    
IN2P3 YES   Space accounting info is integrated in the portal. Other metrics are on the way
JINR YES    
KISTI YES   KISTI has been contacted. Will work on in the second half of September
KIT YES    
NDGF YES   NDGF has a distributed storage which complicates the task. Storage usage publishing has been enabled, others will come later
NLT1 YES   Almost done, waiting for opening of the firewall, order of couple of days
NRC-KI YES    
PIC YES   Space accounting info is integrated in the portal. Other metrics are on the way
RAL YES   Space accounting info is integrated in the portal. Other metrics are on the way
TRIUMF YES    

One can see all sites integrated in storage space accounting for tapes here

Information System Evolution TF

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


Squid Monitoring and HTTP Proxy Discovery TFs

  • IP address ranges are now being read from GOCDB. This is planned to be used by Web Proxy Auto Discovery to tell grid sites apart that share a GeoIP organization but have their own squids, so squids can be correctly assigned for WPAD requests from those sites. Currently an error is returned for this situation. This will not impact OSG sites, but there aren't many OSG sites that have this problem.

Traceability WG

Container WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r12 | r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r8 - 2019-04-04 - KonradKlimaszewski
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback