Week of 150126

WLCG Operations Call details
General Information
Monday
Thursday

WLCG Operations Call details

At CERN the meeting room is 513 R-068.

For remote participation we use the Vidyo system. Instructions can be found here.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Web
Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

local: Luca (SCOD+storage), Maarten (ALICE), Ben (grid services)
remote: Andrej (ATLAS), Mark (LHCb), Di (TRIUMF), Felix (ASGC), Alexander (NLT1), Lisa (FNAL), Rolf (IN2P3), Tiju (RAL), Antonio (CNAF), Sang Un Ahn (KISTI), Ulf (NDGF), Rob (OSG)

Experiments round table:

ATLAS (Andrej) reports (raw view) -
- lack of running jobs due to small issues in PANDA and JEDI
- some issue with misconfigured jobs accessing ATLAS web instead of CVMFS

CMS reports (raw view) -
- no report present

ALICE (Maarten) -
- continued high activity
  - new record today: 75k concurrent jobs!
- data loss at SARA (NLT1)
  - 108k ALICE files (~8 TB) were lost
  - catalog cleanup and re-replication have finished
  - 11804 files had a single replica there, none of which were very important anymore
    - mainly logs and intermediate QA results
  - SARA:
    - 1 week needed for recovering the hardware affected (less available storage)
    - there are no possibilities to recover any of the files that were present on the failed hardware

LHCb (Mark) reports (raw view) -
- "Legacy Run1 Stripping" waiting on FTS transfers and recovery from lost files. Large MC campaign upcoming + User jobs but not much happening right now.
- T0: NTR
- T1: NTR

Sites / Services round table:

ASGC: NTR
BNL:
CNAF: NTR
FNAL: NTR
GridPP:
IN2P3: NTR
JINR:
KISTI: NTR
KIT: NTR
NDGF: NTR
NL-T1: NTR
NRC-KI:
OSG: NTR
PIC:
RAL: NTR
TRIUMF:
- 4 hours of scheduled downtime for power outage Thu 29th January from 1:00am to 5:00am CET

CERN batch and grid services: NTR
CERN storage services:
- Tue 27th January 10:00am 11:00am CET - CASTOR transparent update of the nameserver from 10
- Tue 27th January 9:30am 12:00am CET - EOSATLAS upgrade to EOS aquamarine
Databases:
GGUS:
Grid Monitoring:
MW Officer:

AOB:

Thursday

Attendance:

local: Luca (SCOD+storage), Maarten (ALICE), Manuel (grid services), Mark (LHCb), Pablo (GGUS)
remote: Andrej (ATLAS), Christoph (CMS), Felix (ASGC), Dennis (NLT1), Lisa (FNAL), Michael (BNL), Rolf (IN2P3), Thomas (KIT), Salvatore (CNAF), Jeremy (GridPP), Gareth (RAL), Rob (OSG)

Experiments round table:

ATLAS (Andrej) reports (raw view) -
- issue with FTS3 callbacks not working properly, already in contact with FTS developers
- issue due to CONDOR update of the pilot factories (missing shared library)
- Reminder: next week is ATLAS Software and Computing week. Most probably, no one from ATLAS will be able to attend the daily meetings.

CMS (Christoph) -
- Main activity: Digi-Reco of upgrade (2020+) MC at Tier-1s
- Next round of tape staging tests at Tier-1s started today
  - PIC: GGUS:111478
  - RAL: GGUS:111477
- Consistency checks (Site inventory vs. CMS catalogue) at all Tier-1s ongoing

ALICE (Maarten) -
- continued high activity

LHCb (Mark) reports (raw view) -
- "Legacy Run1 Stripping" waiting on FTS transfers and recovery from lost files. Large MC campaign upcoming + User jobs but not much happening right now.
- T0: NTR
- T1: Status of the incident report for the lost files at SARA

Sites / Services round table:

ASGC: NTR
BNL: NTR
CNAF: ongoing transparent upgrade of linux kernel and glibc
FNAL: found issue in the latest GGUS release, message was sent to a different mail address (see Pablo explanation)
GridPP: NTR
IN2P3: NTR
JINR:
KISTI:
KIT: NTR
NDGF:
NL-T1: NTR
NRC-KI:
OSG: NTR
PIC:
RAL:
- on Monday morning deploy hotfix for CASTOR
- on Tuesday switch to the network backup link for OPN
TRIUMF:

CERN batch and grid services:
- Reminder about VOMRS changes. A testing instance is available and experiments are welcome to test and provide feedback (see GGUS:110227 )
- Reminder about closure of AFS UI on Monday 2nd Feb (more details here )
CERN storage services: NTR
Databases:
GGUS:
- New release done on the 28th of January.
  - All alarms have been acknowledged
  - Contact info for FNAL has changed in OIM from cms-t1-page@fnalNOSPAMPLEASE.gov to joadiaz@fnalNOSPAMPLEASE.gov. Is this the expected behaviour?
Grid Monitoring: NTR
MW Officer:

AOB:

-- AndreaSciaba - 2014-12-16

Topic revision: r15 - 2015-01-30 - LucaMascetti

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback