LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek130624 (2013-06-27, MaartenLitmaath)

EditAttachPDF

Week of 130624

Daily WLCG Operations Call details
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Thursday

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability				SIRs	Broadcasts	Operations Web
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	Broadcast archive	Operations Web

General Information

General Information			GGUS Information	LHC Machine Information
CERN IT status board	WLCG Baseline Versions	WLCG Blogs	GgusInformation	Sharepoint site - LHC Page 1

Monday

Attendance:

local: (MariaD (SCOD), Maarten, Luc (ATLAS), Luca, Luca, Massimo, Gavin)
remote: (Torre, Onno, Daniela (LHCb), Xavier, Tiju, Paolo, Wei-Jen, Jose Hernandez (CMS), Lisa, Michael, Rob, Rolf)

Experiments round table:

ATLAS reports (raw view) -
- T0/Central services
  - Hammercloud unavailable GGUS:95033 & INC:323943. DB unavailable (increased on the number of established sessions with no release of the connections).Ongoing. Additional info on this private db is now in the ticket.
  - CERN-PROD: jobs failing with stage-out errors GGUS:94978. Solved: ldap lookup stuck. Bugfix deployed. Luca said that if we experience another crash there will be an immediate update with the fix.
- T1
  - FZK "put errors" GGUS:95021. jobs/transfers are forced onto the same disk cluster, because all other disks are full. Only 20TB left over the we. Space made up to 60TB.
  - TW-LCG2 transfer errors due to timeout from Taiwan back to BNL. Ongoing

CMS reports (raw view) -
- Quiet weekend
- Intermittent CERN SRM Castor unavailability in SLS monitoring (INC:323167) caused by large activity via srm to t1transfer castor pool which overloads the pool and srm endpoint which times out. CMS operations looking at this. Luca said the intense FNAL-to-CERN traffic observed, is most probably due to FTS transfers.

ALICE -
- CERN: the use of an outdated table in the AliEn DB caused the vast majority of jobs to try reading their input data from SEs at Russian sites, instead of from EOS; this led to a very bad job efficiency on Sat through Sun morning, when the problem was fixed
- CERN: EOS-ALICE became unavailable Sun around 17:30, apparently due to reverse DNS lookups for a particular client machine taking a long time (GGUS:95039); the developers will look into making the relevant code more robust in that respect; meanwhile the problematic client was blocked and the service was again available shortly after midnight, thanks!
  - Luca said CERN/IT/CS experts confirmed the DNS expiration value is as low as 60'' instead of 3 hrs. Checking if this can be changed.
- CNAF: corrupted files were found when part of the 2010 raw data was staged in again to prepare for a reprocessing campaign; it is not yet clear if the data is already corrupted on tape.
  - A ticket was opened after the meeting: GGUS:95073

LHCb reports (raw view) -
- Incremental stripping campaign in progress and MC productions ongoing
- T0:
- T1:
  - GridKa :Server for the major staging pool is out of operation, so we're staging at a reasonable rate, but 30% of jobs fail to access data from the pool. (Input Data Resolution)
  - SARA: Jobs fail with "segmentation violation". (GGUS:95056). All jobs failed at worker node v33-39.gina.sara.nl.
  - CERN: SetupProject.sh timeouts around midnight, now seems to have recovered.

Sites / Services round table:

ASGC: Nothing other than the on-going investigation of the ATLAS problem.
BNL: Michael provided the following written report. They (BNL) spoke about this issue to ZFS recovery experts who would be able to repair the corruption if recovery were probable; they confirmed that in the event of the only known location of the rootbp being corrupted, there is no chance of recovery; reconstructing the tree without a root block and without known location of older instances of the root block is impossible. A list of files on this pool was prepared and handed over to ATLAS DDM operations. Michael agreed to make it into a SIR and link it from WLCGServiceIncidents. MariaD regretted we no more have a forum for SIRs. Maarten said the WLCG MB has such an agenda item. The details:
1. A 64 TB storage pool stopped working on Thursday evening. As BNL is using Solaris and ZFS Oracle service was asked to investigate the problem. They found there was only one valid uberblock on the labels of each LUN; furthermore, the areas of the labels where the other uberblocks should have been (the 128 Kbyte sequential areas each containing 128 uberblocks right next to another) were seemingly overwritten by random data of unknown origin.
2. The rootbp Data Virtual Addresses from this valid uberblock were pointing to a 512-byte range that contained all zeros, and not to a valid L0 DMU Objset structure, making it impossible to reconstruct the binary tree structure of ZFS.
3. The corruption most likely was caused by the hardware controller of the backend SAN – but we cannot really tell which part – write cache or an issue in the controller firmware are both possible causes. In essence, the failure resulted in either writes being lost (which would explain the lack of data at the root block pointer location), or writes randomly going to different blocks than what was requested (which could explain the uberblock structure corruption).

CNAF: The problems experienced by ALICE (no GGUS ticket exists, at this moment) and LHCb are being investigated.
FNAL: ntr
IN2P3: ntr
KIT: A tape management software package crashed yesterday. Staging was impossible till this morning. Now this is fixed.
NDGF: nobody connected
NL_T1: A trouble in dCache pool nodes made files unavailable for less than 1 hour. Scheduled for July 1st: There will be 2 unrelated downtimes scheduled simultaneously for least disturbance, namely a SRM intervention between 7:30-8:00 UTC and a declaration 'at risk' between 7:30-11:30 UTC for file on tape access.
OSG: ntr
PIC: not connected
RAL: ntr

CERN:
- Data management:
  - Today between 16:30-18:00 CEST a CASTOR and database intervention requires the reboot of multiple machines.
  - CMS confirmed they will discuss at their operations meeting later today the removal of EOS-to-CASTOR re-director and confirm approval.
- Grid services: Still struggling with reduced batch capacity under SLC6. Under investigation.
- Databases: CASTOR db planned intervention today http://itssb.web.cern.ch/planned-intervention/storage-intervention-affecting-all-castor-and-csdbrac-databases/24-06-2013 requires a DB freeze of 15'.

GGUS: ntr

AOB: None

Thursday

Attendance:

local: (MariaD (SCOD), Ivan, Luca, Gavin, Maarten)
remote: (Torre (ATLAS), Michael Ernst (BNL), Xavier (KIT), Lisa (FNAL), Wei-jen (ASGC), David Mason (CMS), John Kelly (RAL), Thomas Bellman (NDGF), Daniela (LHCb), Ronald (NL_T1), Paolo Franchini (CNAF), Rob Quick (OSG), Rolf (IN2P3), Pepe (PIC))

Experiments round table:

ATLAS reports (raw view) -
- T0/Central services
  - Monday: Hammercloud restored by DB restart. HC team following up on origin of the problem (spike in DB connections with no release of the connections). GGUS:95033 & INC:323943
- T1
  - SARA-MATRIX srm problems with transfers to and (more recently) from SARA-MATRIX, site has been working it. GGUS:95071
  - FZK-LCG2_DATADISK file missing, reported by DDM team, site asked for info. GGUS:95092
  - IN2P3-CC file transfer failures, OK after SRM restart, origin of the problem not clear, site keeping an eye on it. GGUS:95093
  - RAL: short-lived problem with SRM yesterday, DB unstable, fixed quickly before any GGUS.
  - Addressing losses from unrecoverably corrupted disk pool at BNL, ~618k files lost affecting production and analysis. BNL issued incident report.

Rolf said that IN2P3 established an automatic SRM recovery procedure to avoid the situation of GGUS:95093

. MariaD suggested to remove GGUS:91266

from the ADCOperations twiki "On-going issues".

CMS reports (raw view) -
- Less than happy day on the 26th -- CMS site configurations were corrupted for a few hours thanks to the deletion of a needed symlink dir in transition from CVS to git. Resulted in new jobs during that time not able to find the trivial file catalog for our sites. Problem has now been fixed on the CMS side. Site readiness metrics were affected by this and we will correct, as it was a CMS issue.
  - Related GGUS tickets: GGUS:95101, GGUS:95102, GGUS:95104, GGUS:95105, GGUS:95106, GGUS:95126
  - Probably not a result of this, but likely needs to be fixed for the fix to propagate: RAL GGUS:95134

ALICE -
- CERN: many thousands of "ghost" jobs were found keeping job slots occupied, due to aria2c Torrent processes not exiting after their 15 minutes of lifetime, configured through a command line parameter; we do not know why this misbehavior suddenly appeared, we are not aware of changes in that area; still looking into it; in the meantime stale jobs have been killed directly with "bkill"
- IN2P3: 2 new disk servers were not reachable externally; fixed
- KIT: concurrent jobs cap had to be lowered again a few times to avoid firewall overload; currently it is at 4k, will try higher values when possible

LHCb reports (raw view) -
- Incremental stripping campaign in progress and MC productions ongoing
- T0:
- T1:
  - GridKa: Ticket (GGUS:95135) for slow staging rate. It is the only site lagging behind with the restripping, with rates of 5TB/day since the last weekend. Before that, we had a good staging rate of ~250MB/sec for a couple of days.
  - SARA: No more jobs failing with "segmentation violation". (GGUS:95056)
  - RAL: CVMFS problem (Stratum 1 is down ATM), site contacts are aware and working on a solution. No GGUS ticket submitted yet. MC jobs affected.
  - CNAF: Ticket (GGUS:95059) closed.
- T2:
  - IN2P3-CPPM (GGUS:94890) recurring problem of aborted pilots. blparser stuck/restarted almost every day

Sites / Services round table:

ASGC: ntr
BNL: SIR is now available as WLCGServiceIncidents#Q2_2013 Many database and hardware experts were consulted about the this incident and couldn't explain how the raid controller can corrupt the filesystem. Experts in the room reported that CASTOR and ALICE have experienced similar raid controller failures with severe consequences. The only safe approach is systematic backup. Michael reported that they now take snapshots of critical meta data on disk every 30'' to be on the safe side. The amount of lost files, if any, which have no copies elsewhere will be known in about a week.
CNAF: ntr
FNAL: ntr
IN2P3: nta
KIT: ntr
NDGF: ntr
NL_T1: The SRM instability was due to a dCache AuTHentication server having moved to a node, for which an uninformed team member started maintenance work. Reminder of the already announced scheduled interventions of July 1st: There will be 2 unrelated downtimes scheduled simultaneously for least disturbance, namely a SRM intervention between 7:30-8:00 UTC and a declaration 'at risk' between 7:30-11:30 UTC for file on tape access.
OSG: ntr
PIC: Now migrating 20% of the farm to SL6. Entering production next Monday. Experiments will, then, be called to use the resources. Planning to migrate the whole of the infrastructure to SL6 by mid-July. WNs will be on EMI-2.
RAL: The CVMFS server died and the fail-over procedure didn't work. A manual reconfiguration took place today. There will be a scheduled CASTOR outage next Wednesday.

CERN: Dashboards, Grid and Storage services were represented but they had ntr.

SNOW: It is important for supporters working in SNOW to take ownership of their tickets by assigning them to themselves. Else, after this SNOW bug fix https://cern.service-now.com/service-portal/view-incident.do?n=INC321682 (entering production on 2013/07/01) the SNOW instance will remain in status "Assigned". Any 'save', 'update' or comment addition on the SNOW side will ship this status value back to GGUS. Full story in Savannah:120423#comment88
GGUS: If WLCG VOs have any comment on the request to add a new "Type of Problem" value for CE-related problems please add it to Savannah:138347. We are seeking a good name. This applies to USER tickets. Nothing to do with TEAM, ALARM and the reserved values selected via https://twiki.cern.ch/twiki/bin/view/LCG/WLCGCriticalServices#GGUS_Type_of_Problem_field . Full description of the request in GGUS:95103

AOB:

Topic revision: r12 - 2013-06-27 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback