Week of 231211

WLCG Operations Call details
General Information
Best practices for scheduled downtimes
Monday

WLCG Operations Call details

The connection details for remote participation are provided on this agenda page.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to the wlcg-scod list (at cern.ch) to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Best practices for scheduled downtimes

Monday

Attendance:

local: Julia (WLCG), Maarten (ALICE + WLCG), Panos (WLCG)
remote: Andrew P (NL-T1), Andrew W (TRIUMF), Carmen (CNAF), Christoph (CMS), Daren (RAL), David B (IN2P3-CC), David M (FNAL), Henryk (LHCb + NCBJ), Jens (NDGF), Jose (PIC), Michal (ATLAS), Onno (NLT1), Petr (ATLAS), Xavier (KIT)

Experiments round table:

ATLAS reports (raw view) -
- Issues:
  - Transfers from FZK-LCG2 fail with SSL errors (GGUS:164597)
  - "Server Error" staging failures at pic (GGUS:164580)
    - there were some restore calls to tape, some of which took about 4-5 hours to process due to data being on the same cartridge
  - "Couldnt connect to server" transfer failures at NDGF-T1 (GGUS:164583)
    - SRM downtime not relevant for ATLAS
    - As this keeps happening again and again, will there be some cleaning of SRM from the GOCDB?
  - Related to cleaning of the SRM, this is an issue for tape
    - As there is transition from SRM for tape to tape rest, there will need to be service type in the GOCDB dedicated for that. Maybe "TAPE REST".
      - Maarten pointed out that even though ATLAS is not using SRM anymore other VOs might do, it's up to the site to remove the services from GOCDB. Whether a new service type should be created needs to be discussed in a DOMA BDT meeting. Michal will contact Petr to bring this to one of those meetings.
  - VOMS randomly returning error ("User not known to VO") for crons required for ATLAS grid services
    - timeline
      - Wed 6/12/2023 00h05 : Issue noticed with VOMS randomly returning error for crons required for ATLAS grid services. (atlas: User unknown to this VO)
      - Wed 6/12/2023 09h42 : Ticket opened with SNOW.(INC:3696530)
      - Wed 6/12/2023 10h01 : Told to contact project-lcg-vo-atlas-admin@cernNOSPAMPLEASE.ch
      - Wed 6/12/2023 10h12 : Shaojun contacted project-lcg-vo-atlas-admin@cernNOSPAMPLEASE.ch
      - Wed 6/12/2023 18h08 : 14 emails later, Shaojun discovered issue was appearing on one voms node (lcg-voms2.cern.ch), notified SNOW ticket again,
      - Thur 7/12/2023 08h41 : Experts responded.
      - Thur 7/12/2023 11h08 : Experts discovered the upgraded DB changed some TLS settings on 4/12/2023 and had banned the IP address (assuming of the VOMS server?)
    - feedback
      - Issue could have been solved much earlier if SNOW support had contacted the correct experts early on.
      - Nobody seemed to know who was responsible (or not wanting to take responsibility.)
      - Lots of time wasted by many people trying to debug.
      - Even though VOMS is going to be dismissed: how should these issues be correctly reported and followed up for services that are critical to the running of grid? Not sure ggus would have been any quicker.
        
        Maarten pointed out that for such incidents a GGUS ticket should be created to ensure quicker handling.
        
        When to use GGUS and when to use SNow is documented on this page.

CMS reports (raw view) -
- Maintenance of CMS Oracle DBs at CERN on Tuesday
  - Will affect a number of central services - hopefully only short interruption
  - Possibly impact on central production, likely small though

ALICE
- NTR

LHCb reports (raw view) -
- New Tier-1 site (NCBJ)
  - The new Beijing site was also discussed during the meeting. Henryk pointed out that this site is still being tested and needs to pass an LHCb data challenge before becoming an official LHCb site.

Sites / Services round table:

ASGC:
BNL:
CNAF: NTR
EGI:
FNAL:
IN2P3: Reminder about tomorrow downtime.
- update to dCache 9.2 -> SE in downtime on the morning ( GOCDB #34782)
- update to HTCondor 9.0.20 -> CE in downtime the whole day ( GOCDB #34793)
- maintenance on HPSS -> tape system in downtime for the whole day ( GOCDB #34795)
- maintenance on network switch: update with no impact if everything goes well, but risky operation ( GODCB #34791)
JINR: NTR
KISTI:
KIT: Last week's downtime was very difficult, because 4 people fell ill and compensating for that was tough. We had finished most stuff by Thursday evening, though we added a follow-up downtime for CMS SRM and all CEs (GOCDB #34832). As became clear by today morning, thanks in part to the tickets GGUS:164594, GGUS:164595 and GGUS:164597, we had two more issues left:
- The CRL cache was not updated, which is fixed now
- HTCondor-CE-1 broke physically (GOCDB:34826) and we're working on a replacement.

NDGF: Last weeks attempt to upgrade dCache to 9.2 was postponed due to a bug found in the alice xrootd door (the atlas xrootd door seems to be OK). Investigation is in progress and if a solution is found the dCache upgrade might be rescheduled to another date before Christmas. The scheduled Postgresql 12->15 upgrade was completed as well as the zookeeper cluster upgrade and an 8.2.38 headnode update of dCache.
NL-T1: The SARA-MATRIX dCache will be upgraded on January 8 and 9. (GOCDB:34839). A storage system at Nikhef had a raid volume fail last week. The atlaslocalgroup disc at Nikhef was briefly unavailable for about an hour but no atlas data was lost.
NRC-KI:
OSG:
PIC: PIC will be in OUTAGE from 12-12-2023 08:00 (PIC local time) until12-12-2023 15:00 (PIC local time). We will perform a dCache upgrade (8.2.38) and a migration of the virtualization platform. HTCondor will be stopped at the beginning of the SD.
RAL: NTR
TRIUMF: NTR

CERN computing services:
- OTG0146692: upgrade of condor clients on lxplus9 on 18.12.2023
- Remember LXPLUS9 is now the default LXPLUS since 07.12.2023 (OTG0076556):
  - If you want to use CC7, you need to explicitly login to lxplus7
  - Open a SNOW ticket if you see some packages missing in LXPLUS(9)
  - Remember you may need to recompile your code if you were using CC7 and now you plan to use EL9
CERN storage services:
CERN databases:
GGUS: NTR
Monitoring:
- Distributed draft SiteMon availability/reliability reports for 2023 11.
Middleware:
- New VOMS service for dteam available since Dec 7
  - It is the VOMS endpoint of the new IAM service for the VO
  - It should soon replace the current VOMS(-Admin) service of the VO
  - An EGI broadcast was sent asking sites to add the new LSC file
  - The VOMS client details will be broadcast in a few days
  - The new user registration details will be broadcast separately
Networks:
Security:

AOB:

Topic revision: r23 - 2023-12-13 - NikolayVoytishin

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback