Week of 140414

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Felix (ASGC), Stefan (SCOD), Maarten (ALICE), Robert (Grid Monitoring), Luca (Storage), MariaA (WLCG), Zbyszek (Databases)
  • remote: Rob (OSG), Sang-Un (KISTI), Onno (NL-T1), Lucia (CNAF), Lisa (FNAL), Tiju (RAL), Rolf (IN2P3), Raja (LHCb), Michael (BNL)

Experiments round table:

  • ATLAS
    • Central services
      • Frontier servers in RAL and IN2P3-CC in troubles, also BNL shows high load. Turns out that most probably there are Overlay jobs which requires 230MB instead of the std 70, and to get this it's taking up to one hour. Maybe then we have havalanche effect. We decided to hold the task r5475 and kill all the running jobs.
      • AGIS servers in troubles. Turns out that the HV underlying is in troubles. Not clear who is creating the problems (HW issues?) or what
    • T0/1s
      • RRC-KI-T1 logs project mc09, mc09_10TeV and mc09_14TeV. It is ~ 95TB and ~1.4M files. As of Today Monday morning: finished 453 still ongoing 428

  • ALICE -
    • KIT: continued investigations of jobs behavior
      • many jobs found ignoring the local SE for reading!
        • causing overload of firewall and OPN at times
        • being debugged
      • jobs cap is being kept low (1k) for the time being

  • LHCb
    • Stripping, MCsimulation, Working Group Production and User jobs.
    • T1:
      • IN2P3 : Downtime this morning for two CEs to upgrade openSSL. Not much effect on LHCb.
      • GridKa : Issue of aborted pilots (GGUS:103319) seems to have gone away. Problem of uploading by jobs (GGUS:102920) seems to continue.
    • Others : LFC very slow now. Lost x2 in speed since Oracle upgrade. Now response time ~400ms (was 100ms previously) (INC:0534625)

Sites / Services round table:

  • ASGC: instabilities over the week-end on the tape system, currently under investigation with the vendor
  • OSG: NTR
  • KISTI: NTR
  • NL-T1: NTR
  • CNAF: NTR
  • FNAL: NTR
  • RAL: NTR
  • IN2P3: NTR
  • BNL: NTR
  • Grid Monitoring: Problem at beginning of last week, ce205 test failing for LHCb with “database error”
  • Storage: NTR
  • Databases: Two databases - COMPASS ONLINE and PDBR (ALICE potential user) moved to new HW in barn and oracle12c. This evening (18.00-22.00) intervention on central DB CERNDB1/EDMSDB/ACCDB, e.g. storing information for access to buildings

AOB:

Thursday

Attendance:

  • local: Felix (ASGC), Marcin (DB), Maarten (ALICE), Stefan (SCOD)
  • remote: Eygene (RRC-KI-T1), Xavier (KIT), Sang-Un (KISTI), Rolf (IN2P3), Sonia (CNAF), Raja (LHCb), Rob (OSG), Lisa (FNAL), Michael (BNL), John (RAL), Ale (ATLAS)
  • apologies: Oli (CMS)

Experiments round table:

  • ATLAS
    • Central services
      • CERN-PROD alarm ticket GGUS:104757 atlas.web.cern.ch down.
        • Ale: GGUS ticket was routed through SNOW, but entries in GGUS/SNOW differ quite a lot, could hint to a problem with handling of the ticket. Will be followed up by Maarten
    • T0/1s

  • CMS
    • NTR, overview in today's WLCG operations planning meeting
    • Apologies, I cannot participate in today's call because of meeting conflicts.

  • ALICE -
    • KIT jobs behavior:
      • The cause of the overload of the KIT firewall and the OPN link to CERN has been found:
        • for analysis jobs the location of the WN was not propagated to the central services
        • the client then ended up using not only the local replica, but close replicas as well
      • Over the weekend a patch was developed for TAlienFile in ROOT
        • it now sends the location information explicitly
      • The patch became first available in Tuesday's analysis tag and started getting used by a few trains and users
        • the results were very good
      • The expectation is for the vast majority of analysis jobs to be using the patched code after a few days
        • this will be monitored
      • The jobs cap should then get increased again gradually over the coming days
      • Our thanks to the teams at KIT for their efforts, and to the other experiments for their patience!
        • Raja: For LHCb this was only observed in windows of 2-3 hours every morning. Maarten: Maybe not all problems are solved then yet.
        • Ale: question to ALICE, still monitoring issue not clear, WLCG transfer dashboard should display numbers of all experiments correctly. Maarten: are all transfers included? Ale: for ATLAS yes, we should take this offline to another meeting.

  • LHCb
    • Stripping, MCsimulation, Working Group Production and User jobs.
    • T1:
      • IN2P3 : Last few files still staging for incremental re-stripping.
      • GridKa : Problem of uploading by jobs (GGUS:102920) possibly understood to be Alice bug overloading site SE/Network. Wait till next week to see if the problem is resolved.
    • Others : LFC very slow now. Lost x2 in speed since Oracle upgrade. Now response time ~400ms (was 100ms previously) (INC:0534625). Not clear from the ticket who is following this up.
      • Marcin: will assign LFC ticket back to the LFC operations team

Sites / Services round table:

  • Firewall update at KIT on the 28th 5:00 UTC for one hour. GGUS will be in downtime
  • RRC-KI-T1: new peering with ESnet over LHCONE is active since April 15th. Going to activate LHCOPN peering (transit) with CERN this week or at the beginning of the next one.
  • ASGC: Tape library recovered yesterday after vendor intervention.
  • KIT: Broken tape library, re-directed to other libraries, data is 3-4 years old. All VOs affected. - Fixed at the end of the meeting
  • KISTI: NTR
  • IN2P3: NTR
  • CNAF: NTR
  • OSG: NTR
  • FNAL: NTR
  • BNL: NTR
  • RAL: NTR
  • NL-T1: NTR
  • Databases: CERNDB1 intervention, upgrade to 11.2.0.4 on Monday went successful

AOB:

  • ATTENTION: next meeting on Tuesday April 22 !
  • Have a good Easter break !

-- SimoneCampana - 20 Feb 2014

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx MB-Apr.pptx r1 manage 2864.9 K 2014-04-15 - 09:27 PabloSaiz  
Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2014-04-17 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback