WLCG Operations Coordination Minutes, April 4, 2019

Highlights

Agenda

https://indico.cern.ch/event/810489/

Attendance

  • local: Alessandro (ATLAS), Julia (WLCG), Maarten (ALICE + WLCG), Stephan (CMS)
  • remote: Alessandra D (Napoli), Catherine (IN2P3 + LPSC), Dave (FNAL), Di (TRIUMF), Dmitry (?), Eric (IN2P3-CC), Felix (ASGC), Johannes (ATLAS), Konrad (LHCb), Marcelo (CNAF), Panos (WLCG), Ron (NLT1)
  • apologies:

Operations News

  • WLCG Operations together with the DOMA QoS prepares the questionnaire to be sent to the sites in the coming weeks. We are counting on the experiments to engage sites and help us to ensure that they take part in the survey.
  • The joined WLCG-EGI workshop dedicated to the migration of sites from CREAM to HTCondor, ARC or no-CE solutions will take place on the 7th of May during the EGI conference in Amsterdam. The agenda page is not yet available. We will send the link around as soon as it is ready. We invite sites who are planning migration from CREAM CE to join the workshop at least remotely.

  • The next meeting is planned for May 16
    • Please let us know if that date would be very inconvenient

Special topics

Operational Intelligence

See the presentation

Discussion

No questions

Site Questionnaire

See the presentation

Discussion

  • Alessandro:
    • to be able to help better, the experiments should see the proposed questions
    • are T3 sites also going to be targeted?
  • Julia:
    • we will involve the experiments regarding the questions
    • T3 sites could be included, while the main focus is on the T2 sites

Middleware News

  • Useful Links
  • Baselines/News
    • From the ARC development team. Unless there is a very well justified reason please NOBODY should deploy an ARC 5.x CE any longer! ARC team is working hard on the next major release, the ARC6 series and that is the RECOMMENDED version to deploy as a new ARC CE installation. ARC 6 is currently in pre-release testing and already deployed on a couple of Nordic production sites : the official release is expected to be out very soon. All information about ARC6 is collected here
    • There was a discussion about increasing the baseline frontier-squid to version 3.5.28-3.1. Dave Dykstra will send the proposal & rationale to the wlcg-ops-coordination mailing list to discuss it further there.

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • High activity on average
  • No major issues
  • CERN: 20k "ghost jobs" discovered on March 12, another 10k on March 20
    • Not visible in MonALISA
    • Running in the batch system, but some found stuck in gdb
    • The number of bad jobs steadily shrunk over 2 days
    • Root cause not understood
  • Prague: network issues affecting ALICE jobs since late Feb
    • The site has needed to be blocked for many weeks
    • Experts looking into it

ATLAS

  • Smooth Grid production over the last weeks with ~300k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and analysis and a small fraction of dedicated data reprocessing. Some periods of additional HPC contributions with peaks of ~300k concurrently running job slots and ~15k jobs from Boinc.
  • In the past weeks Distributed analysis was seriously affected by CPU steal on 2 of 5 PanDA/JEDI servers at CERN - fixed by CERN IT by reshuffling the VMs on the Hypervisors. This is the 3rd occurrence of serious CPU steal on CERN infrastructure in the past 6 months.
  • Commissioning of the Harvester submission system via PanDA almost done apart from a handful of US sites. Commissioning of a new PanDA worker node pilot version on-going.
  • DPM: Continuous data deletion problems at DPM sites: already observed in the past but e.g. in the last 30 days 15 GGUS tickets opened about deletion failures at DPM sites vs. 3 for all the rest. Out of the 15: 2 DOME DPM and 13 legacy DPM sites (but 8 of this are Apache related and would impact also the new DOMA version). We are using HTTP/webdav for deletions since it is much faster than any other method. Problem with apache modules on SL6 and CentOS7 ? See LCGDM-2699 and LCGDM-2783
  • ATLAS and CERN Tape Archive (CTA): interesting and challenging work ongoing between ATLAS DDM experts and CTA experts to put in pre-production the first CTA version, while discussing possible optimization to write and recall files/datasets. Summary of these discussions will be brought to the WLCG Archival WG.

Discussion

  • Stephan: CMS also had deletion issues at 2 DPM sites
  • Johannes: as such issues have been ongoing for quite a while,
    we finally decided to raise the matter here
  • Julia: we will follow up with the DPM devs (done)

CMS

  • smooth running, compute systems busy at about 250k cores
    • usual production/analysis mix (80%/20%)
  • first part of the parked B physics data staged back from tape
    • expected processing time about two months
  • heavy-ion data recorded last November need to be re-processed
    • heavy tape data recall during the next couple of weeks
  • 2017 and 2018 Monte Carlo production ongoing
  • tape deletion campaign being prepared
    • expect to start deletion in May
  • EOS metadata lookup limit reached purging old log files
    • in the process of changing directory organization to reduce/eliminate excessive file stats

LHCb

  • Smooth running at ~100K jobs
    • MC simulation, Data stripping and user analysis
  • No major problems

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • Accounting session has been organized during the last GDB. The following topics were discussed: APEL status and plans, EGI accounting portal status and plans, HTCondor accounting, CERN accounting, accounting data validation.
  • The generic solution for HTCondor CE accounting with APEL is ready to be validated by the sites who are planning to migrate to HTCondor CE. Please, follow up the documentation. In case you need help, please, contact Stephen Jones ( sjones@hepNOSPAMPLEASE.ph.liv.ac.uk ).

Archival Storage WG

Update of providing tape info

PLEASE CHECK AND UPDATE THIS TABLE
Site Info enabled Plans Comments
CERN YES    
BNL YES    
CNAF YES   Space accounting info is integrated in the portal. Other metrics are on the way
FNAL YES    
IN2P3 YES   Space accounting info is integrated in the portal. Other metrics are on the way
JINR YES    
KISTI YES   KISTI has been contacted. Will work on in the second half of September
KIT YES    
NDGF YES   NDGF has a distributed storage which complicates the task. Storage usage publishing has been enabled, others will come later
NLT1 YES   Almost done, waiting for opening of the firewall, order of couple of days
NRC-KI YES    
PIC YES   Space accounting info is integrated in the portal. Other metrics are on the way
RAL YES   Space accounting info is integrated in the portal. Other metrics are on the way
TRIUMF YES    

One can see all sites integrated in storage space accounting for tapes here

Information System Evolution TF

  • NTR

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

DPM Upgrade Task Force

  • Epel testing of the DPM version 1.12 is finishing. The new release will be promoted on Monday. Meanwhile, the pioneer sites Brunel, Lancaster, Prague, Beijing, IRFU upgraded to 1.12. Those participating to the DOMA TPC campaign are showing very good results.

Monitoring

MW Readiness WG

Network Throughput WG


Squid Monitoring and HTTP Proxy Discovery TFs

  • IP address ranges are now being read from GOCDB. This is planned to be used by Web Proxy Auto Discovery to tell grid sites apart that share a GeoIP organization but have their own squids, so squids can be correctly assigned for WPAD requests from those sites. Currently an error is returned for this situation. This will not impact OSG sites, but there aren't many OSG sites that have this problem.

Traceability WG

Container WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2019-04-05 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback