Week of 150803

WLCG Operations Call details
General Information
Monday
Thursday

WLCG Operations Call details

At CERN the meeting room is 513 R-068.

For remote participation we use the Vidyo system. Instructions can be found here.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Web
Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

local: Maria Alandes (chair, minutes), Christoph Wissing (CMS), Jiri Chudoba (ATLAS), Ben Jones (Grid&Batch), Mark Slater (LHCb), Belinda Chan Kwok Cheong (Storage)
remote: Dima Kovalskyi (CMS), Dimitri (KIT), Alexander Verkooijen (NL-T1), Michael Ernst (BNL), Gerard Bernabeu (FNAL), Gareth Smith (RAL), Antonio Falabella (CNAF), Sebastian Gadrat (IN2P3)

Experiments round table:

ATLAS reports (raw view) -
- RAL network outage for 1 hour (affected mainly squid availability)
- Monitoring shows wrong status of Squid (service is up, monitoring says down)

Discussion on how to improve the monitoring is taken offline among ATLAS and Ops Coordination to see how to solve this issue.

CMS reports (raw view) -

Nothing to report.

ALICE -
- high activity
- CERN: a lot of the raw data currently under processing could not be accessed from CASTOR during the weekend.
  - Disk pool was under heavy load.
  - Some mitigation was applied by the CASTOR team.
  - A fix is scheduled for Tue Aug 4.

LHCb reports (raw view) -
- Data Processing: data "stripping" productions ongoing. User and MC jobs
- T0
  - Discussion with LSF team about wall and cpu time queue lengths, ongoing (GGUS:115027). Some progress and bugs found on both sides. Deployment of fixes happening in progress.
- T1
  - RAL: LAN network instabilities continue, CVMFS and data access problems over the weekend.

Sites / Services round table:

ASGC: NA
BNL: Very high activity regarding tape migration over the weekend. E.g. on Saturday data volume of 80TB/70k files was moved to tape in 24 hours. It's the highest activity registered for tape storage so far.
CNAF: NTR
FNAL: There will be a downtime on Wednesday 5th August from 8h to 17h where all storage services will be rebooted.
GridPP: NA
IN2P3: NTR
JINR: NA
KISTI: NA
KIT: A big downtime is planned between 30th September to 7th October. The whole site will be offline. GOCDB details will be given soon.
NDGF: Two upcoming warning downtimes for srm.ndgf.org, which affects availability of some ALICE and ATLAS data:
- Tuesday 11-14 UTC
- Wednesday 12-13 UTC
NL-T1: Scheduled downtime tomorrow afternoon for the tape backend that will last 24h.
NRC-KI: NA
OSG: GGUS:115413 opened to site but in fact related to Global Xrootd redirector. Maria will follow up after the meeting.
PIC: NA
RAL: Outage scheduled tomorrow to check routers causing the network problems affecting RAL during the weekend. T1 will be unavailable for 6h. Details in GOCDB
TRIUMF: NA

CERN batch and grid services: NTR
CERN storage services: A set of transparent CASTOR upgrades happening this week:
- CASTORCMS - Thu 6th August from 10h to 12h
- CASTORLHCb - Wed 5th August from 14h to 16h
- CASTORATLAS - Wed 5th August from 10h to 12h
- CASTORALICE - Tue 4th August from 10h to 12h
Databases: NA
GGUS: NA
Grid Monitoring:
- Draft availability reports for July sent, and available at the SAM3 UI
MW Officer: NA

AOB: None

Thursday

Attendance:

local: Maria Alandes (chair, minutes), Christoph Wissing (CMS), Jiri Chudoba (ATLAS), Iain Bradford Steers (Grid&Batch), Mark Slater (LHCb), Belinda Chan Kwok Cheong (Storage), Asa and Cheng-Hsi (ASGC)
remote: Dennis van Dook (NL-T1), Michael Ernst (BNL), Gerard Bernabeu (FNAL), John Kelly (RAL), Sebastian Gadrat (IN2P3), Kyle Gross (OSG), Sang Uhn (KISTI), Peter Urkedal (NDGF), Di Qing (TRIUMF)

Experiments round table:

ATLAS reports (raw view) -
- NTR

CMS reports (raw view) -
- High activity due to a new wave of MC requests
  - More than 100k parallel jobs in the system
  - No major issues
- Followup about xrootd redirector issue from Monday (GGUS:115413, still need to be updated)
- Last weekend the US regional redirector(s) run into trouble
- On Monday the global CMS redirector at CERN went into trouble GGUS:115437
  - One host (behind the alias) already upgraded to a more recent xrootd version
ALICE -
- high activity
- CERN: thousands of files still persistently inaccessible from CASTOR
  - reported as 'STAGED', but inaccessible from the WN or interactively
    - xrootd or rfcp
  - many retries were done over several days
  - also after the interventions earlier this week
  - the CASTOR team are aware

LHCb reports (raw view) -
- Data Processing: data "stripping" productions ongoing. User and MC jobs
- T0
  - Discussion with LSF team about wall and cpu time queue lengths, ongoing (GGUS:115027). Some progress and bugs found on both sides. Deployment of fixes happening in progress.
- T1
  - RAL: All network problems appear to be solved.

Sites / Services round table:

ASGC: Scheduled downtime on Monday 10th August affecting several services. For more details, please check GOCDB
BNL: NTR
CNAF: NA
FNAL: NTR
GridPP: NA
IN2P3: NTR
JINR: NA
KISTI: NTR
KIT: NA
NDGF: NTR
NL-T1: NTR
NRC-KI: NA
OSG: Follow up of ticket mentioned on Monday, things seem to be better.
PIC: NA
RAL: The network intervention finished successfully and there are now redundant routers.
TRIUMF: NTR

CERN batch and grid services: CEs were down today due to an intervention in the LSF filers that took longer than expected. Submission is back online and everything working as expected. More details in SSB.
CERN storage services:
- Upgrade of CMS AAA redirectors to XrootD 4.2.2: 2nd node will be upgraded next Tuesday, so far we didn't experience any lockup of the CMSd on the upgraded node.
Databases: NA
GGUS: NA
Grid Monitoring:
- ATLAS request for recomputation: GGUS:115587 Solved

MW Officer: NA

AOB: None

Topic revision: r12 - 2015-08-11 - unknown

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback