Week of 131104

WLCG Operations Call details

To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: Simone (SCOD), Maarten (Alice), Raja (LHCb), Maria (GGUS ex-officio), Xavi (CERN-DSS), Jerome (CERN-PES), Ignacio (CERN-PES), Zbyszek (CERN-DB), Alessandro (ATLAS)
  • remote: Felix (ASGC), Xavier (KIT), Sang-Un (KISTI), Ulf (NDGF), Michael (BNL), Rolf (IN2P3-CC), Alexey (ATLAS), Lisa (FNAL), Tiju (RAL), Pepe (PIC), Onno (NL-T1), Rob (OSG), Ian (CMS), Salvatore (CNAF)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services/T0
      • FTS error with proxy handling (fts305.cern.ch) . A known bug seen previously. INC:423729 and new GGUS:98586 (more info in older GGUS:97960 and Thursday 17 Oct WLCG report)
    • T1
      • TAIWAN-LCG2: job (savannah:140513) and transfer failures continue (GGUS:98482)
      • SARA-MATRIX: transfer failures (Unable to connect to s35-07.grid.sara.nl:2811), (GGUS:98370)

  • CMS reports (raw view) -
    • CMS is performing the Global Run in November (GRIN) this week. This is the first time the detector will be read out as completely as possible since the start of LS1. The Tier-0 will perform the core functions, but data export to a remote Tier-1 is not requested. Expected rate is 2Hz of cosmic events.

  • ALICE -
    • CERN: SLC6 jobs typically have lower efficiency and higher failure rates than SLC5 jobs; to be investigated further
    • CVMFS "Input/output errors" seen during the weekend at least at CNAF, KIT and Torino
      • CNAF had a problem with a switch on Sat, which may be related

  • LHCb reports (raw view) -
    • Main activities are incremental stripping (T0/1) and Simulation(T2)
    • T1:
      • SARA: Unstable after downtime end
      • IN2P3 : Slow staging over the weekend (GGUS:98605)
    • Others :

Sites / Services round table:

  • ASGC: one disk server lost in DPM (data lost 140TB). Contacting ATLAS with the list of files.
  • BNL: brief outage of SRM on saturday, promptly fixed.
  • FNAL: the T1 started the migration to SL6. Will be completed by the end of november.
  • RAL: tomorrow there will be a big intervention on power system. All details in GOCDB.
  • PIC: last saturday the SRM service was unstable for 3h due to high number of requests.
  • NL-T1: recovering from a file system problem behind the storage. The data is being moved out of the unstable file system (Approx 3M files). 30 files look corrupted (a list will be provided). In the process, some new pool nodes have been installed and some of them did not have the firewall rules set properly and were therefore not allowing connections (fixed now). Some tape pools have been removed, but not recreated yet; consequently, some ATLAS pool groups were left "out of pools" (also fixed). Some more pools will be created to get back to the previous pool size. Also, as consequence of the pools reshuffling, some staged data will probably need to be staged again, but possibly (please) not now as the system is still recovering; tomorrow would be better.
  • OSG: failure in RSV to SAM observed saturday at 10AM till 13:00. The issue recovered by itself and some diagnostic is being done now.

  • CERN DSS: the CMS EOS instance will be upgraded on Nov 12 to the newest version. CASTOR update almost done: a transparent DB update job is running in the background until early next week. Plan to phase out the "root" protocol in CASTOR. The date has been fixed as Feb 1st 2014. Looking now for remaining users accessing data through the root protocol and contacting them.

Thursday

Attendance:

  • local: Simone (SCOD), , Maarten (Alice), Raja (LHCb), Xavier (IT-DSS), Luca (IT-DSS), Ignacio (IT-PES), Maarten (Alice), Jerome (CERN-PES), Zbyszek (IT-DB), Massimo (IT-DSS)
  • remote: Stefano (CMS), Ulf (NDGF), Felix (ASGC), Sang-Un (KISTI), Michael (BNL), Dennis (NL-T1), Alexei (ATLAS), Sonia (CNAF), Jeremy (GridPP), Lisa (FNAL), Gareth (RAL), Rolf (IN2P3-CC), Rob (OSG)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services/T0
      • Questions from Monday about FTS error with proxy handling (fts305.cern.ch)
        • Would it be possible for CERN FTS server managers to setup some check ?
        • Why is it always fts305? it seems this machine is never in the DNS, but the message seems to come always with the same fts305 hostname. We are not sure if it's correct, can someone check?
    • T1
      • TAIWAN-LCG2: job (Savannah:140513) and transfer failures continue (GGUS:98482)
      • SARA-MATRIX: transfer failures due to unavailable files, (GGUS:98370)
      • NDGF-T1 : transfer failures to site with error "GetSpaceMetaData for tokens [95226] failed: [SRM_FAILURE] The operation failed for all spaces", (GGUS:98216)

  • CMS reports (raw view) -
    • CMS is performing the Global Run in November (GRIN) this week. Nothing special happening.
    • Central Production and Analysis running normally. New problem of the month is CVMFS cache getting stale/corrupted on some WN. All sort of odd things happen in that case.
    • From GGUS land: issues with CMSSW installations at CC-IN2P3, some release missing but announced (bad), some present but not announced (not a big deal). GGUS:98581 , GGUS:98575
      • NOTE: we do not use announced release list for job matching in glideinWMS, will soon stop maintaining them where we install SW ourselves, and hope every side to use CVMFS asap
      • IN2P3 for CMS: the CMS problem will be corrected by the new version of CVMFS. Installing the new client would imply a downtime for also other non affected experiments, so as temporary solution the CVMFS branch of CMS was remounted and this worked. Maarten: there are strong recommendations to install version 2.1.15 ASAP (unfortunately the upgrade should be transparent, but there is a 5% chance that the WN will need a reboot afterwards). IN2P3-CC: the new version will be installed later during a more general scheduled outage (like next week for the dCache upgrade).

  • ALICE -
    • CERN: CVMFS failures on at least 1 SLC5 WN (GGUS:98713)

  • LHCb reports (raw view) -
    • Main activities are incremental stripping (T0/1) and Simulation(T2)
    • T1:
      • SARA: Improved stability.
      • IN2P3 : (GGUS:98605) Staging speed has improved
        • IN2P3 for LHCb: the staging problem should be gone completely. The issue came from two different problems in the robotics introducing timeouts in the storage system.
    • Others :
      • Problem with some Brazilian certificates at various sites. (e.g. GGUS:98570, GGUS:98571)
        • Effort was put to debug the issue but this is not fully understood yet. Work in progress.
      • Problem with new LFC nodes not accessible outside CERN (GGUS:98629). Fixed.
      • Various problems with Tier-2 sites : Uploading outputs (GGUS:98594), Redundant jobs (GGUS:98627), ...

Sites / Services round table:

  • NDGF: two issues have been reported concerning data transfer/access at NDGF, causes were different: 1) on wednesday the dCache head node was updated, but the intervention was mistakenly registered in GOCDB (wrong day); so the storage was down when not expected, but the intervention also introduced a problem that was corrected on Thu. 2) There is currently a transfer issue with SARA which started suspiciously at the time SARA switched the connectivity with NDGF from the OPN to LHCONE; so the problem should be investigated at the SARA side. Simone+Maarten: can we understand the reasons for this switch and can we know if the switch is supposed to be temporary or permanent? NL-T1 people will gather information and report at the next meeting.
  • NL-T1: the already reported problem with a filesystem was due to a disk array controller which failed and is currently out of warranty; it will take time to get the server replaced completely. Temporarily, the disks have been routed to a different controller already being used, so the system is currently running in degraded mode and the affected file systems have been set read-only as a precaution. DPM has been upgraded to version 1.8.7 (enabling WebDAV access).
  • RAL: there was an outage on tuesday as planned for work at the UPS. Most services were brought online by tuesday evening (batch was back on wednesday) and they are running OK since then.

  • CERN PES: on monday, a problem was found in the access to LFCs from the outside, thanks to the LHCb report. The alias definition in LANDB needs modifications for the ATLAS LFC, but it is in the hands of ATLAS DDM people. They have been contacted, no reply yet.
  • CERN DSS: degradation of the CASTOR namespace on tuesday after the upgrade (solved after a few hours). The database job running since monday did half of the work as expected, so people are confident it will finish by monday as expected.
  • CERN DSS: a CMS user script accidentally deleted a large number of files on EOS owned by 28 CMS users. DSS and CMS are in contact to provide "best practices" for users protecting the namespace. A service incident report has been prepared.

AOB:

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2013-11-07 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback