Week of 210222

WLCG Operations Call details
General Information
Best practices for scheduled downtimes
Monday

WLCG Operations Call details

At CERN the meeting room is 513-R-068.

For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
- The pass code is provided on the wlcg-operations list.
- You can contact the wlcg-ops-coord-chairpeople list (at cern.ch) if you do not manage to subscribe.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Best practices for scheduled downtimes

Monday

Attendance:

local:
remote: Alberto (Monitoring), Andrew P (NL-T1), Andrew (TRIUMF), Ankur (Computing), Borja (Chair, Monitoring), Christoph (CMS), Concezio (LHCb), Daniel (Security), Daniele (CNAF), David B (IN2P3-CC), David C (ATLAS), Dave (FNAL), Darren (RAL), Jens (NDGF), Julia (WLCG), Maarten (ALICE), Marian (Monitoring, Network), Pepe (PIC), Petr (ATLAS), Xavier (KIT), Xin (BNL)

Experiments round table:

ATLAS reports (raw view) -
- Stable production - few sites with electrical / cooling issues
  - OSG yaml topology format gives sites to much freedom to make mistakes
  - Problems matching OSG downtimes with ATLAS CRIC resources
- Network issues
  - INFN-T1 - AGLT2 high number of stuck transfers (GGUS:150642) together with BoL FTS 3.9.x issue FTS-1491 caused 100% transfer failures from TAPE - AGLT2 moved to fts3-pilot
  - CERN non-LHCONE subnets - Australia-ATLAS routing problems (GGUS:150596)
- ARC-CE 6.10.1 is not compatible with HTCondor nordugrid job submission
  - HTCondor unable to obtain job status from ARC-CE: "The job's remote status is unknown"
  - ATLAS kills jobs in unknown state after 4 hours

- Maarten suggested to open a ticket for OSG about the concerns from ATLAS
- Petr mentioned the big issue is the flexibility and that it allows for sites to submit information wrongly
- Julia requested a pointer to an open ticket about the issue, so WLCG can coordinate some effort to contact the OSG operations

CMS reports (raw view) -
- Continued high activity
  - ~300k cores (250k production & 50k analysis)
- No major problems

ALICE
- NTR

LHCb reports (raw view) -
- Activity:
  - MC and WG productions; User jobs.
- Issues:
  - CERN: ticket (GGUS:150647) for checksum issues, in progress
  - Migration to CTA might start next week, and we might stop CASTOR this week
  - Submissions to HTCondorCEs not anymore problematic (some tweaking on our side)

- Petr asked if LHCb uses TPC in push or pull model, since ATLAS does it in pull using EOS without issues
- Concezio will ask and take into consideration the comment from ATLAS

Sites / Services round table:

ASGC: NC
BNL: NTR
CNAF: NTR
EGI: NC
FNAL: NTR
IN2P3: NTR
JINR: NTR
KISTI:
KIT:
- Network downtime on Thursday unfortunately was not as smooth as we had hoped. The downtime had to be extended into the evening and one border router was brought into an "unsafe state". We asked Cisco for support and on Saturday we got the advice to reboot the machine, which was done, but still there were some issues that only could be resolved today.
- ATLAS transfer errors (GGUS:150665) explained by Rucio misconfiguration: Too many recall requests were submitted, which lead to many timeouts.
NDGF: NTR
NL-T1: Nikhef: still working on the arc-ce issues reported last week.
NRC-KI: NC
OSG: NC
PIC: NTR
RAL: NTR
TRIUMF: NTR

CERN computing services: NTR
CERN storage services: NC
CERN databases: NC
GGUS: NTR
Monitoring:
- Distributed final SiteMon availability/reliability reports for January 2021
- Migration of SiteMon job to CRIC scheduled for Thursday OTG:0062075
  - Original target was last week (delayed due to some inconsistencies found in CRIC)
  - Will probably be postponed until 1st of March to start from a fresh month
- SAM/ETF - issues with job submission to multiple sites running ARC-CEs via gsiftp/HTCondor-G (Prague, Nikhef, UAM, ECDF, DESY, Lunarc, KIAE) - mostly for ATLAS but also for CMS (for some sites only certain endpoints impacted).
  - Following successful job submission there is a job evicted event reported in the HTCondor-G job log, so it looks like jobs are disappearing.
  - Looking into this further with the help from Nikhef has shown that the jobs are actually running fine, but the status is not propagated correctly.
  - All works fine if using ARC-CE's AREX interface. This issue is not impacting A/R numbers (it increases the unknown fraction; there are no errors reported).

- Maarten asked since when this is visible, Marian replied he started looking into it due to some issues with the January A/R report.
- David Cameron reported the issue is due to a change introduced in ARC6.10 in order to comply with GDPR
  - The DNs of the jobs are hashed now and queries with previous clients will result in no jobs returned and a wrong reflection of the status
  - Recommendation to sites is not to upgrade or apply a one line patch if they already did
- Maarten mentioned this is very important and needs to be addressed correctly and with a target deadline of 2 months or so
- David said the big issue is coordinating the deployment across all the WLCG, Julia mentioned WLCG ops will help with that
- Maarten also mentioned we should not try to keep supporting CREAM for some sites (which would block using the latest HTCondor versions): if a site can't comply with the requirements, it should stop being used.
- Issue will be presented and discussed more in detail in the next Operations Coordination meeting (4th of March).

Middleware:
- ARC CE: issues with the recent release are being followed up by the ARC and HTCondor devs
Networks:
- ASGC network impacted over the weekend by an electricity voltage dropped in Zafarana, Egypt (following submarine cable system incident few weeks ago, their backup link is still down, will be available on 28th of February)
Security: NTR

AOB:

Topic revision: r18 - 2021-02-23 - NikolayVoytishin

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback