Week of 130128

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance: local(Belinda, Eddie, Jerome, Luc, Luca M, Maarten, Maria D);remote(Ian, Lisa, Matteo, Michael, Onno, Pepe, Rob, Tiju, Ulf, Vladimir, Wei-Jen).

Experiments round table:

  • ATLAS reports -
    • T0/Central services
      • Acrontab service not working. Alarm GGUS:90928
      • T0 reconstruction tasks crashing due to low memory. Elog:42411. To be rerun in 64-bits
    • Tier1s
      • FZK & CNAF DATADISKS full. Taken out of T0 export
      • TRIUMF backlog of functional tests transfers from T0. Very high data activity (300MB/s). Link saturation?
        • Maarten: open a ticket?
      • FTS3 deployed in all UK (except QMUL)
        • Maarten: correction - all UK sites except QMUL are participating in functional tests through FTS3; RAL is the only UK site with FTS3 deployed; QMUL cannot participate yet due to a bug in StoRM
    • Tier2s
      • NTR

  • CMS reports -
    • LHC / CMS
      • Running
    • CERN / central services and T0
      • Reasonably quiet weekend
    • Tier-1:
    • Tier-2:
      • NTR
    • Luca: green light for the transparent part of the EOS upgrade?
    • Ian: OK, the non-transparent part is to be done on Feb 14

  • LHCb reports -
    • Activity is going as last week. Very intense activity: reprocessing, prompt-processing, MC and user jobs. We are hitting our limits in running jobs.
    • T0:
    • T1:

Sites / Services round table:

  • ASGC
    • network maintenance on the 10 Gbit link tomorrow, 2.5 Gbit backup link will be used during that time
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • NDGF
    • one tape system has become full for ALICE; a second tape system at that site is expected to come online in 2 weeks; furthermore a new site supporting ALICE should join around the same time
  • NLT1
    • on Sat there was a power surge in one of the SARA data centers, affecting one of the two tape libraries: 9 drives broke and have meanwhile been replaced; during the incident some tape accesses may have been slow
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • dashboards - ntr
  • GGUS/SNOW
    • Maria will check if the SNOW team already applied the patch to fix the problem seen with last week's alarm tests, viz. the creation of real instead of test incidents in SNOW, which upsets the alarm ticket statistics calculations
  • grid services - ntr
  • storage - ntr

AOB:

Tuesday

Attendance: local(Eddie, Ian, Ignacio, Luc, Luca M, Maarten, Maria D, Ulrich);remote(Federico, Jeremy, Lisa, Matteo, Michael, Pepe, Rob, Rolf, Ronald, Saverio, Tiju, Ulf, Wei-Jen, Xavier).

Experiments round table:

  • ATLAS reports -
    • T0/Central services
      • NTR
    • Tier1s
      • FZK & CNAF DATADISKS full. Still out of T0 export
      • TRIUMF backlog of functional tests transfers from T0. A priori no 5Gb/s link saturation
      • All UK sites (except QMUL) participating in FTS3 tests
      • TW FTS hw failure. Elog:42514
    • Tier2s
      • T2 starvation in DE. Big dataset (5k files) stuck. Site service machine bug. Fix applied to all SS

  • CMS reports -
    • LHC / CMS
      • Running. Estimates for total data expected from HI run is about our expectations. Raw + Reco should be under 500TB collected and distributed. Event size and reco time are both a little better than initially expected.
    • CERN / central services and T0
      • NTR
    • Tier-1:
      • A Black hole node was reported at KIT. GGUS ticket issued.
    • Tier-2:
      • NTR

  • LHCb reports -
    • Activity is going as last week. NTR

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3 - ntr
  • KIT
    • at-risk downtime for tape maintenance had to be prolonged by 4h but completed OK, hopefully bringing some performance/stability improvements
  • NDGF - ntr
  • NLT1 - ntr
  • OSG
    • as of tomorrow OSG users will be recommended to start using the new DigiCert CA for new certificates instead of the DOEGrids CA that is being phased out for new certificates in the course of this year
      • Maria: I have adjusted the LCG catch-all CA procedures and updated the docs, will let Rob proofread them before dissemination
    • Ian: GGUS currently seems to be having a problem with the DOEGrids CRL? will follow up offline
  • PIC - ntr
  • RAL - ntr

  • dashboards - ntr

  • GGUS/SNOW
    • Tests will start now as agreed yesterday to make sure the SNOW patch on "Ticket Category: Test" is working properly.
    • There will be a meeting on 2013/02/05 at 2:30pm CET about alternative to personal certificates authentication methods to GGUS and possibly to GOCDB. WLCG position summarised in Savannah:132872#comment6 . Draft agenda https://indico.cern.ch/conferenceDisplay.py?confId=229577 . Room is 28-1-025 (~12 people) but there is audioconf. If we need to changed the room please tell MariaD a.s.a.p.
  • grid services
    • GSI problems holding up the LFC migration to EMI-2 have been fixed: the Globus GSI libraries need to be preloaded to prevent the corresponding Oracle symbols taking precedence; one EMI-2 front-end node has already been included for ATLAS, with other nodes to be similarly replaced one by one over the coming days, also for LHCb and the central instance (e.g. used by OPS); ATLAS and LHCb will be informed of the progress
  • storage - ntr

AOB:

Wednesday

Attendance: local(Alex, Belinda, Luc, Luca C, Luca M, Maarten);remote(Alexander, Federico, Ian, Lisa, Pavel, Rob, Rolf, Saverio, Ulf, Wei-Jen).

Experiments round table:

  • ATLAS reports -
    • T0/Central services
      • NTR
    • Tier1s
      • FZK & CNAF DATADISKS full. Cleaning (unmerged HITS ongoing). Still out of T0 export
    • Tier2s
      • NTR

  • ALICE reports -
    • KIT: ALICE will increase by 500 the number of concurrent (analysis) jobs, currently throttled at 2k; KIT will check if the load on their firewall remains sustainable

  • LHCb reports -
    • Nothing new to report

Sites / Services round table:

  • ASGC - ntr
  • CNAF - ntr
  • IN2P3 - ntr
  • KIT - ntr
  • NDGF - ntr
  • NLT1
    • At SARA we had a Warning downtime this morning because we had to reboot our tape management servers. We did this in order to solve a problem with our tape backend (for some reason tape access has become slow the last few days) but unfortunately it didn't solve the problem.
  • OSG
    • switching from DOEGrids to DigiCert CA for new certificates as of today

  • databases - ntr
  • grid services - ntr
  • storage - ntr

AOB:

  • many remote people got their Audioconf connections cut multiple times and some gave up eventually; ticket INC:230369 was opened for the experts:
    • "There was a general problem on Cern phone lines. The technicians fixed it now."

Thursday

Attendance: local(Alex B, Alex L, Eva, Luc, Maarten);remote(Gareth, Kyle, Matteo, Pepe, Rolf, Ronald, Saverio, Ulf, Wei-Jen, WooJin).

Experiments round table:

  • ATLAS reports -
    • T0/Central services
      • NTR
    • Tier1s
      • FZK & CNAF DATADISKS full. Cleaning ongoing. Still out of T0 export
      • PIC & SARA DATADISK getting full.
      • RAL Castor instance problem fixed (disk server failure)
      • QMUL to RAL transfers failing GGUS:91029. Suspect RAL FTS agent.
        • Gareth: this appears to have started before the CASTOR issue, so would be unrelated
    • Tier2s
      • NTR

  • ALICE reports -
    • KIT: the number of concurrent analysis jobs was further increased to 2750.

  • LHCb reports -
    • Nothing new to report

Sites / Services round table:

  • ASGC
    • CMS are affected by a transfer problem from IN2P3 to CASTOR at ASGC; looking into it
  • CNAF
    • CREAM ce08 will be down from tomorrow noon to Feb 7 for an upgrade to EMI-2 on SL6
  • IN2P3 - ntr
  • KIT - ntr
  • NDGF
    • last night there was a break of the OPN connection to Finland: a planned maintenance took 3h15m instead of the expected 20 min; only ALICE may have been affected, but no inquiries were received
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL - nta

  • dashboards - ntr
  • databases
    • during the next 2 weeks the security and recommended Oracle patches will be deployed on the integration databases
  • grid services - ntr

AOB:

Friday

Attendance: local(Alessandro, Alex L, Eddie, Eva, Ian, Ignacio, Luca M, Maarten);remote(Federico, John, Kyle, Lisa, Onno, Roger, Rolf, Saverio, Wei-Jen, Xavier).

Experiments round table:

  • ATLAS reports -
    • Central services
      • Panda had an increase of "holding jobs" (of a factor ~5) starting around 15pm CET. From SLS plots of Panda seem that the problem was related to LFC+DQ2 registration time, which could be due LFC or DQ2 slowness. Another plot in Panda SLS report only about DQ2 registration time, which seem to be the culprit. Experts are investigating. Since just yesterday afternoon a new LFC EMI2+ node was added to the ATLAS production LFC, to avoid issues we asked LFC service manager to remove the new EMI nodes from the alias for now, even if from the SLS monitoring the LFC performances seems to be ok ATLAS LFC
        • Alessandro: the DBAs did not see any issues in the DB usage by the LFC service
        • Ignacio: the ATLAS LFC service now has 3 EMI-2 and 4 gLite 3.2 nodes; the first EMI-2 node was introduced on Tue; next week the upgrade will be resumed if no issues are observed
    • Tier0/Tier1s
      • QMUL to RAL transfers failing GGUS:91029 reported yesterday, RAL answered saying they had a CASTOR problem which was then fixed.

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • NTR
    • Tier-1:
      • GGUS Ticket issued for glexec issue at a subset of WN at RAL (GGUS:91060)
    • Tier-2:
      • Suspected File Corruption at T2_US_Purdue (SAV:135604)

  • ALICE reports -
    • KIT: another 500 slots were just added to the number of concurrent analysis jobs. Experts at KIT will keep monitoring their firewall load and alert ALICE when the traffic for ALICE jobs is getting unsustainable.
      • Xavier: the WN routing configuration was wrong in 2 racks, causing their Xrootd traffic to go through the firewall unnecessarily; fixed
      • Maarten: very good news indeed, thanks! We will see if this allows us to add yet more job slots still before the weekend
        • this is happening indeed

  • LHCb reports -
    • Nothing new to report, just few tickets for pilots aborting at Tier2s. New LFC is ok.

Sites / Services round table:

  • ASGC - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT - nta
  • NDGF - ntr
  • NLT1
    • early in the night 2 dCache pool nodes at SARA failed; they were restarted at 07:30 CET, their files having been unavailable for ~7h
  • OSG - ntr
  • RAL
    • will follow up on the open tickets

  • dashboards - ntr
  • databases - ntr
  • grid services
    • all CERN LFC services (ATLAS, LHCb, shared incl. OPS) currently have a mix of EMI-2 and gLite 3.2 nodes; the upgrades will continue next week if no issues are observed in the meantime
  • storage - ntr

AOB:

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2013-02-01 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback