WLCG Operations Coordination Minutes, Jan 30, 2020

Highlights

  • AVX2 on the WLCG infrastructure. Collecting CPU flags

Agenda

https://indico.cern.ch/event/883512/

Attendance

  • local: Robert Currie (LHCb), Concezio Bozzi (LHCb), Jaroslava Schovancova (CERN)
  • remote: David Cohen (IL-TAU-HEP), Di Qing (Triumf), David Mason (FNAL), Eric Fede (IN2P3), Giuseppe Bagliesi (CMS), Helge Meinhard (CERN), Johannes Elmsheuser (ATLAS), Matthew Steven Doidge (Lancaster), Panos Paparrigopoulos (CERN), Peter Love (ATLAS), Petr Vokac (ATLAS/Prague), Stephan Lammel (CMS)
  • apologies:

Operations News

Special topics

AVX2 on the WLCG infrastructure. Collecting CPU flags

see presentation

Discussion

- David Mason: what is the purpose of collecting AVX2 information in CRIC or other central repository?

- Johannes: it is not realistic to use this info for brokering

- Julia: not for brokering, but rather to assess the situation on the whole WLCG and at particular sites

- Helge questioned the probability of getting performance gained since it was not yet proven by any experience

- Johannes: ATLAS did perform testing and performance gain was not demonstrated

- Julia : whether LHCb performed any analysis analysis in this respect?

- Concezio: There had been some attempts for software trigger, which showed that code optimization would be required in order to get performance gain

- David Mason: effort to understand the situation with AVX2 and other features is useful. However is not clear how this information can be used to direct jobs to the appropriate resources

- Julia: whether it is possible that the test is not always showing correct results? AVX2 is enabled on the WN, but test is false negative

- Jarka: Could be hidden by virtualization. Will be investigated further

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Normal to high activity in recent weeks
    • Lots of everything: MC, reconstruction, analysis trains, user jobs
  • No major issues

ATLAS

  • Over the christmas break very smooth and very stable Grid production with ~400k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~75k slots from the HLT/Sim@CERN-P1 farm.
  • Started RAW/DRAW reprocessing campaign with data18 in data/tape carousel mode last Tuesday (21 Jan) at Tier1s and CERN CTA. Overload of FTS CERN production instance, caused by bulk request injection of tape staging, data rebalancing and data consolidations. Other FTS instances are working fine. This high load caused some Tier1 staging throughput degradation in the past days. FTS expert significantly improved situation in DBonDemand on Tuesday. data17 staging will start in a few weeks.
  • No other major other issues apart from the usual storage or transfer related problems at sites.
  • Affected by network switch/CEPH incident on Jan 22/23, but speedy recovery by restarting a few systems
  • Discussions with the CTA team to put CTA in production
  • ATLAS discussions about how to move forward with TPC testing - e.g. switch one site in non-gridftp mode and use it for bulk transfers
  • AGIS to CRIC migration in progress
  • Grand unification of PanDA queues on-going: unify separate production and analysis queues for more dynamic job scheduling.

CMS

  • running at about 250k cores during last month
    • usual production/analysis mix (75%/25%)
    • ultra-legacy re-reconstruction of 2017 data almost complete
    • ultra-legacy re-reconstruction of 2018 data progressing well
    • short on disk space at many sites due to changed production pattern, extra cleaning effort underway
  • first certificate authority has a root certificate with an expiration date of summer 2038, i.e. after a signed 32-bit integer as UNIX date turns over; older certificate utilities use a signed 32-bit; especially xrootd versions below 4.9.0 fail;
  • CERN network outages, manual intervention to recover, at fortunate times thus not too disruptive
  • SSB dashboard switch to MonIT scheduled for February 17th

Comment

There are 11 CAs whose expiration dates are beyond 2038, some used in production for WLCG since a few years already

For example, these CMS SEs at GRIF:

    grid05.lal.in2p3.fr
    node12.datagrid.cea.fr
    polgrid4.in2p3.fr

Their host certificates are all signed by this CA:

$ openssl x509 -noout -subject -dates -in /etc/grid-security/certificates/AC-GRID-FR-Services.pem 
subject= /C=FR/O=MENESR/OU=GRID-FR/CN=AC GRID-FR Services
notBefore=Sep 30 08:00:00 2016 GMT
notAfter=Sep 30 08:00:00 2040 GMT

LHCb

  • Normal activity over past month
    • Mostly MC with Stripping campaign starting in the last week
  • Had to restart some services after network outage but no major disruptions, highlighted reliance on Gitlab service
  • No major issues

Task Forces and Working Groups

Upgrade of the T1 storage instances for TPC

GDPR and WLCG services

Accounting TF

  • Sites of 0,1,and 2 tiers took part in validation of the December accounting data. CERN showed good agreement with auto-generated data for CPU, disk and tape storage. Good agreement was also demonstrated with ATLAS and LHCb CPU accounting data.

Archival Storage WG

Containers WG

CREAM migration TF

dCache upgrade TF

  • Out of 44 dCache sites used by the LHC VOs, 21 sites still need to migrate. Should accomplish by spring 2020.
  • Have an issue with SRR at some sites which publish empty list for data shares. Is being followed up with dCache experts

Discussion

-Peter Love: Would it be possible to accelerate upgrade?

- Julia : We started end of autumn and are progressing pretty well. Sites are well participating. We hope that majority of sites do migrate by the beginning of spring which would be a good result

- Petr Vokac: There are also STORM sites which would need to migrate and enable SRR.

- Julia : Good point. We need to get in touch with Andrea to ask for SRR documentation for STORM and then we can start with STORM as well. There are not too many sites, so most probably, we do not need a task force for it.

DPM upgrade TF

  • Out of 55 DPM sites used by LHC VOs 5 left to upgrade and reconfigure, 6 upgraded but have to be reconfigured. Should accomplish by the end of February.

Information System Evolution TF

  • REBUS functionality in CRIC is being validated by WLCG Project Office (Cath)
  • CRIC team had a meeting with the MONIT team and agreed on the plan for integration MONIT applications with CRIC and REBUS retirement plan
  • After validation of REBUS functionality in CRIC , REBUS will be put in read-only mode (spring this year)
  • All clients using REBUS info should start migration to CRIC for pledge and federation topology information, please, contact cric-devs@cernNOSPAMPLEASE.ch to coordinate this migration
  • There will be a presentation at the next GDB about REBUS functionality in CRIC and REBUS retirement

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


  • CERN Networking Week took place 13-17th January (https://wiki.geant.org/display/SIGNGN/4th+SIG-NGN+Meeting)*
  • Feedback from LHCOPN/LHCONE workshop (https://indico.cern.ch/event/828520/)
    • Importance of network monitoring has been stressed out by most of the experiments (covered many topics including perfSONAR up to requests for detailed packet telemetry)
    • Focus on analytics, better insights into existing results would be beneficial for most of the experiments
    • DOMA project had a dedicated slide on perfSONAR, highlighted it as a very useful diagnostic tool.
    • DUNE is planning to establish perfSONAR mesh
    • Several experiments have mentioned lack of available/used capacity monitoring
    • Some experiments have mentioned missing API to access network LHCOPN/LHCONE topologies
  • Next steps and follow up discussion will take place at the LHCOPN/LHCONE Asia (8-9th March)
  • LHCOPN/LHCONE WS had also a dedicated session on the future of LHC networking
    • Dedicated TF will be setup to work on packet tagging/pacing and network orchestration in close collaboration with the experiments
  • perfSONAR infrastructure status - please ensure you're running the latest version 4.2.2-1.el7
  • 100 Gbps perfSONAR testbed mailing list to join: http://cern.ch/simba3/SelfSubscription.aspx?groupName=wlcg-perfsonar-100g
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Traceability WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

  • Next meeting will take place 5th of March
Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2020-03-03 - JuliaAndreeva
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback