Week of 130722

Daily WLCG Operations Call details
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Thursday

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability				SIRs	Broadcasts	Operations Web
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	Broadcast archive	Operations Web

General Information

General Information			GGUS Information	LHC Machine Information
CERN IT status board	WLCG Baseline Versions	WLCG Blogs	GgusInformation	Sharepoint site - LHC Page 1

Monday

Attendance:

local: AndreaS, Steve, Maarten, Ken/CMS, Eddie, Massimo
remote: Alexander/NL-T1, Michael/BNL, Peter/ATLAS, Xavier/KIT, Saverio/CNAF, Boris/NDGF, Nadia/IN2P3-CC, Kyle/OSG, Tiju/RAL, Lisa/FNAL, Wei-Jen/ASGC, Sang-Un/KISTI

Experiments round table:

ATLAS reports (raw view) -
- T0/Central services
  - NTR
- T1
  - BNL file recovery progressing
  - RAL disk server problem. All files successfully recovered.
  - SARA Transfer errors (Failed to contact on remote SRM) GGUS:95911. Fixed

CMS reports (raw view) -
- Continuing 2011 legacy rereco activity and some Upgrade MC generation, everything pretty quiet
- From last week, RAL GGUS:95825 was declared a Castor bug.
- Castor T1TRANSFER service was degraded again over the weekend. But still not a real problem; probably deserves some discussion with CMS as to what the shifter instructions should be. [Massimo: the problem is related to an increasing mismatch between the CASTOR tuning and the evolution of the CMS usage patterns; it would be good to have a meeting where to discuss these issues.]
- GGUS:95887 to T1_FR_IN2P3 is about file read errors. Ticket came Friday, site started working on it today.
- Wigner T-Systems 100 Gbit link down for several hours over the weekend. [Maarten: Wigner vs. CERN computer centre should be transparent: don't mind too much about SLS errors unless they have a direct effect. Ken: will check if there was any impact on CMS.]

ALICE -
- CERN: all CERN CEs started refusing proxies of ALICE and other VOs around 18:20 Sunday evening; fixed around 09:15 Monday morning (TEAM ticket GGUS:95914) [Maarten: the problem was most likely related to a bad ARGUS configuration and it was solved by Ulrich but there are no details in the ticket.]

LHCb reports (raw view) -

Sites / Services round table:

ASGC: ntr
BNL: ntr
CNAF: ntr
FNAL: ntr
IN2P3-CC: ntr
KIT: ntr, just a reminder of a downtime starting tomorrow by closing the queues to update the SE for LHCb
KISTI: ntr
NDGF: SRM downtime tomorrow from 7 to 9, to update dCache and install a new host certificate; some disk pools will also be updated.
NL-T1: ntr
RAL: tomorrow, "at risk" intervention to the site firewall, followed from 10 to 16 by a CASTOR upgrade for LHCb and CMS
OSG: ntr
CERN batch and grid services: ntr
CERN storage services: just upgraded SRM for all experiments (apart from LHCb, which was upgraded last week). The upgrade should have been transparent.
Dashboards: ntr

AOB:

Thursday

Attendance:

local: Eddie, Eva, Luca M, Maarten, Marcin, Nilo, Steve
remote: Boris, Dennis, Federico, Gareth, Kyle, Lisa, Matteo, Michael, Nadia, Pavel, Peter, Sang-Un, Stefano, Wei-Jen

Experiments round table:

ATLAS reports (raw view) -
- T0/Central services
  - ATLR database outage yesterday evening, looks fine now INCIDENT
    - Marcin:
      - since ~19:00 there was a problem with the NFS filer serving several DBs including ATLR
      - the fail-over mechanism was unable to move the load from 1 node to another
      - the affected DBs thus remained inaccessible
      - a priority-one case was opened with NetApp
      - the problem was solved shortly after midnight
      - there was no data loss
      - a HW problem was identified
      - spare parts are expected this Fri
    - Peter: are the DBs currently at risk?
    - Marcin: yes, fail-over is not possible at the moment
- T1
  - All except cern and bnl PRODDISK endpoints have been removed from ATLAS data management
  - FZK and DE T2 sites now configured to use FTS3

CMS reports (raw view) -
- T1 disk/tape separation is getting done (RAL first site) and we are starting to nail out details of how to send users analysis jobs to T1's using pilots
- Continuing 2011 legacy rereco activity and some Upgrade MC generation, everything pretty quiet
- GGUS:95887 to T1_FR_IN2P3 about file read error solved by reconfiguring dCache there (more time to dCache movers). Details not understood.
- GGUS:96102: file read error at T1_UK_RAL . Site now looking into this.
- nothing else worth mentioning

ALICE -
- NTR

LHCb reports (raw view) -
- Mostly MC productions ongoing, tail of reprocessing and restripping campaign
- T0:
  - CERN: Issues with Oracle DBs
    - see ATLAS report
- T1:
  - SARA: pilots aborting (GGUS:96087)

Sites / Services round table:

ASGC - ntr
BNL - ntr
CNAF
- ATLAS job failures due to lack of disk space being investigated
- CMS file system maintenance, queues being drained, HammerCloud tests will fail; intervention should be finished tomorrow
FNAL - ntr
IN2P3 - ntr
KISTI
- the second half of the 2013 CPU pledge will become available in Sep
KIT
- still in scheduled downtime, no problems observed so far, expect to finish a bit earlier than announced
  - main firewall upgraded OK
  - dCache work ongoing
  - GPFS, CVMFS, Xrootd, FTS, ... done
NDGF
- yesterday there was a problem when ATLAS tried writing to disks in Sweden for which SRM reported the wrong size: they became full; the matter is being investigated
NLT1 - ntr
OSG - ntr
RAL
- CASTOR upgrade for CMS and LHCb went OK on Tue
- generic instance (also used by ALICE) will be done next Tue
- batch server problem being investigated

dashboards - ntr
databases
- see ATLAS report
grid services
- yesterday 4 out of 8 CREAM CEs have been upgraded to EMI-3
storage - ntr

AOB:

Topic revision: r6 - 2013-07-25 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback