Week of 130624

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: (MariaD (SCOD), Maarten, Luc (ATLAS), Luca, Luca, Massimo, Gavin)
  • remote: (Torre, Onno, Daniela (LHCb), Xavier, Tiju, Paolo, Wei-Jen, Jose Hernandez (CMS), Lisa, Michael, Rob, Rolf)

Experiments round table:

  • ATLAS reports (raw view) -
    • T0/Central services
      • Hammercloud unavailable GGUS:95033 & INC:323943. DB unavailable (increased on the number of established sessions with no release of the connections).Ongoing. Additional info on this private db is now in the ticket.
      • CERN-PROD: jobs failing with stage-out errors GGUS:94978. Solved: ldap lookup stuck. Bugfix deployed. Luca said that if we experience another crash there will be an immediate update with the fix.
    • T1
      • FZK "put errors" GGUS:95021. jobs/transfers are forced onto the same disk cluster, because all other disks are full. Only 20TB left over the we. Space made up to 60TB.
      • TW-LCG2 transfer errors due to timeout from Taiwan back to BNL. Ongoing

  • CMS reports (raw view) -
    • Quiet weekend
    • Intermittent CERN SRM Castor unavailability in SLS monitoring (INC:323167) caused by large activity via srm to t1transfer castor pool which overloads the pool and srm endpoint which times out. CMS operations looking at this. Luca said the intense FNAL-to-CERN traffic observed, is most probably due to FTS transfers.

  • ALICE -
    • CERN: the use of an outdated table in the AliEn DB caused the vast majority of jobs to try reading their input data from SEs at Russian sites, instead of from EOS; this led to a very bad job efficiency on Sat through Sun morning, when the problem was fixed
    • CERN: EOS-ALICE became unavailable Sun around 17:30, apparently due to reverse DNS lookups for a particular client machine taking a long time (GGUS:95039); the developers will look into making the relevant code more robust in that respect; meanwhile the problematic client was blocked and the service was again available shortly after midnight, thanks!
      • Luca said CERN/IT/CS experts confirmed the DNS expiration value is as low as 60'' instead of 3 hrs. Checking if this can be changed.
    • CNAF: corrupted files were found when part of the 2010 raw data was staged in again to prepare for a reprocessing campaign; it is not yet clear if the data is already corrupted on tape.
      • A ticket was opened after the meeting: GGUS:95073

  • LHCb reports (raw view) -
    • Incremental stripping campaign in progress and MC productions ongoing
    • T0:
    • T1:
      • GridKa :Server for the major staging pool is out of operation, so we're staging at a reasonable rate, but 30% of jobs fail to access data from the pool. (Input Data Resolution)
      • SARA: Jobs fail with "segmentation violation". (GGUS:95056). All jobs failed at worker node v33-39.gina.sara.nl.
      • CERN: SetupProject.sh timeouts around midnight, now seems to have recovered.

Sites / Services round table:

  • ASGC: Nothing other than the on-going investigation of the ATLAS problem.
  • BNL: Michael provided the following written report. They (BNL) spoke about this issue to ZFS recovery experts who would be able to repair the corruption if recovery were probable; they confirmed that in the event of the only known location of the rootbp being corrupted, there is no chance of recovery; reconstructing the tree without a root block and without known location of older instances of the root block is impossible. A list of files on this pool was prepared and handed over to ATLAS DDM operations. Michael agreed to make it into a SIR and link it from WLCGServiceIncidents. MariaD regretted we no more have a forum for SIRs. Maarten said the WLCG MB has such an agenda item. The details:
    1. A 64 TB storage pool stopped working on Thursday evening. As BNL is using Solaris and ZFS Oracle service was asked to investigate the problem. They found there was only one valid uberblock on the labels of each LUN; furthermore, the areas of the labels where the other uberblocks should have been (the 128 Kbyte sequential areas each containing 128 uberblocks right next to another) were seemingly overwritten by random data of unknown origin.
    2. The rootbp Data Virtual Addresses from this valid uberblock were pointing to a 512-byte range that contained all zeros, and not to a valid L0 DMU Objset structure, making it impossible to reconstruct the binary tree structure of ZFS.
    3. The corruption most likely was caused by the hardware controller of the backend SAN – but we cannot really tell which part – write cache or an issue in the controller firmware are both possible causes. In essence, the failure resulted in either writes being lost (which would explain the lack of data at the root block pointer location), or writes randomly going to different blocks than what was requested (which could explain the uberblock structure corruption).

  • CNAF: The problems experienced by ALICE (no GGUS ticket exists, at this moment) and LHCb are being investigated.
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: A tape management software package crashed yesterday. Staging was impossible till this morning. Now this is fixed.
  • NDGF: nobody connected
  • NL_T1: A trouble in dCache pool nodes made files unavailable for less than 1 hour. Scheduled for July 1st: There will be 2 unrelated downtimes scheduled simultaneously for least disturbance, namely a SRM intervention between 7:30-8:00 UTC and a declaration 'at risk' between 7:30-11:30 UTC for file on tape access.
  • OSG: ntr
  • PIC: not connected
  • RAL: ntr

  • GGUS: ntr

AOB: None

Thursday

Attendance:

  • local: (MariaD (SCOD), Ivan, Luca, Gavin, Maarten)
  • remote: (Torre (ATLAS), Michael Ernst (BNL), Xavier (KIT), Lisa (FNAL), Wei-jen (ASGC), David Mason (CMS), John Kelly (RAL), Thomas Bellman (NDGF), Daniela (LHCb), Ronald (NL_T1), Paolo Franchini (CNAF), Rob Quick (OSG), Rolf (IN2P3), Pepe (PIC))

Experiments round table:

  • ATLAS reports (raw view) -
    • T0/Central services
      • Monday: Hammercloud restored by DB restart. HC team following up on origin of the problem (spike in DB connections with no release of the connections). GGUS:95033 & INC:323943
    • T1
      • SARA-MATRIX srm problems with transfers to and (more recently) from SARA-MATRIX, site has been working it. GGUS:95071
      • FZK-LCG2_DATADISK file missing, reported by DDM team, site asked for info. GGUS:95092
      • IN2P3-CC file transfer failures, OK after SRM restart, origin of the problem not clear, site keeping an eye on it. GGUS:95093
      • RAL: short-lived problem with SRM yesterday, DB unstable, fixed quickly before any GGUS.
      • Addressing losses from unrecoverably corrupted disk pool at BNL, ~618k files lost affecting production and analysis. BNL issued incident report.
Rolf said that IN2P3 established an automatic SRM recovery procedure to avoid the situation of GGUS:95093. MariaD suggested to remove GGUS:91266 from the ADCOperations twiki "On-going issues".

  • CMS reports (raw view) -
    • Less than happy day on the 26th -- CMS site configurations were corrupted for a few hours thanks to the deletion of a needed symlink dir in transition from CVS to git. Resulted in new jobs during that time not able to find the trivial file catalog for our sites. Problem has now been fixed on the CMS side. Site readiness metrics were affected by this and we will correct, as it was a CMS issue.

  • ALICE -
    • CERN: many thousands of "ghost" jobs were found keeping job slots occupied, due to aria2c Torrent processes not exiting after their 15 minutes of lifetime, configured through a command line parameter; we do not know why this misbehavior suddenly appeared, we are not aware of changes in that area; still looking into it; in the meantime stale jobs have been killed directly with "bkill"
    • IN2P3: 2 new disk servers were not reachable externally; fixed
    • KIT: concurrent jobs cap had to be lowered again a few times to avoid firewall overload; currently it is at 4k, will try higher values when possible

  • LHCb reports (raw view) -
    • Incremental stripping campaign in progress and MC productions ongoing
    • T0:
    • T1:
      • GridKa: Ticket (GGUS:95135) for slow staging rate. It is the only site lagging behind with the restripping, with rates of 5TB/day since the last weekend. Before that, we had a good staging rate of ~250MB/sec for a couple of days.
      • SARA: No more jobs failing with "segmentation violation". (GGUS:95056)
      • RAL: CVMFS problem (Stratum 1 is down ATM), site contacts are aware and working on a solution. No GGUS ticket submitted yet. MC jobs affected.
      • CNAF: Ticket (GGUS:95059) closed.
    • T2:
      • IN2P3-CPPM (GGUS:94890) recurring problem of aborted pilots. blparser stuck/restarted almost every day

Sites / Services round table:

  • ASGC: ntr
  • BNL: SIR is now available as WLCGServiceIncidents#Q2_2013 Many database and hardware experts were consulted about the this incident and couldn't explain how the raid controller can corrupt the filesystem. Experts in the room reported that CASTOR and ALICE have experienced similar raid controller failures with severe consequences. The only safe approach is systematic backup. Michael reported that they now take snapshots of critical meta data on disk every 30'' to be on the safe side. The amount of lost files, if any, which have no copies elsewhere will be known in about a week.
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: nta
  • KIT: ntr
  • NDGF: ntr
  • NL_T1: The SRM instability was due to a dCache AuTHentication server having moved to a node, for which an uninformed team member started maintenance work. Reminder of the already announced scheduled interventions of July 1st: There will be 2 unrelated downtimes scheduled simultaneously for least disturbance, namely a SRM intervention between 7:30-8:00 UTC and a declaration 'at risk' between 7:30-11:30 UTC for file on tape access.
  • OSG: ntr
  • PIC: Now migrating 20% of the farm to SL6. Entering production next Monday. Experiments will, then, be called to use the resources. Planning to migrate the whole of the infrastructure to SL6 by mid-July. WNs will be on EMI-2.
  • RAL: The CVMFS server died and the fail-over procedure didn't work. A manual reconfiguration took place today. There will be a scheduled CASTOR outage next Wednesday.

  • CERN: Dashboards, Grid and Storage services were represented but they had ntr.

AOB:

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2013-06-27 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback