Week of 150126

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Luca (SCOD+storage), Maarten (ALICE), Ben (grid services)
  • remote: Andrej (ATLAS), Mark (LHCb), Di (TRIUMF), Felix (ASGC), Alexander (NLT1), Lisa (FNAL), Rolf (IN2P3), Tiju (RAL), Antonio (CNAF), Sang Un Ahn (KISTI), Ulf (NDGF), Rob (OSG)

Experiments round table:

  • ATLAS (Andrej) reports (raw view) -
    • lack of running jobs due to small issues in PANDA and JEDI
    • some issue with misconfigured jobs accessing ATLAS web instead of CVMFS

  • ALICE (Maarten) -
    • continued high activity
      • new record today: 75k concurrent jobs!
    • data loss at SARA (NLT1)
      • 108k ALICE files (~8 TB) were lost
      • catalog cleanup and re-replication have finished
      • 11804 files had a single replica there, none of which were very important anymore
        • mainly logs and intermediate QA results
      • SARA:
        • 1 week needed for recovering the hardware affected (less available storage)
        • there are no possibilities to recover any of the files that were present on the failed hardware

  • LHCb (Mark) reports (raw view) -
    • "Legacy Run1 Stripping" waiting on FTS transfers and recovery from lost files. Large MC campaign upcoming + User jobs but not much happening right now.
    • T0: NTR
    • T1: NTR

Sites / Services round table:

  • ASGC: NTR
  • BNL:
  • CNAF: NTR
  • FNAL: NTR
  • GridPP:
  • IN2P3: NTR
  • JINR:
  • KISTI: NTR
  • KIT: NTR
  • NDGF: NTR
  • NL-T1: NTR
  • NRC-KI:
  • OSG: NTR
  • PIC:
  • RAL: NTR
  • TRIUMF:
    • 4 hours of scheduled downtime for power outage Thu 29th January from 1:00am to 5:00am CET

  • CERN batch and grid services: NTR
  • CERN storage services:
    • Tue 27th January 10:00am 11:00am CET - CASTOR transparent update of the nameserver from 10
    • Tue 27th January 9:30am 12:00am CET - EOSATLAS upgrade to EOS aquamarine
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Luca (SCOD+storage), Maarten (ALICE), Manuel (grid services), Mark (LHCb), Pablo (GGUS)
  • remote: Andrej (ATLAS), Christoph (CMS), Felix (ASGC), Dennis (NLT1), Lisa (FNAL), Michael (BNL), Rolf (IN2P3), Thomas (KIT), Salvatore (CNAF), Jeremy (GridPP), Gareth (RAL), Rob (OSG)

Experiments round table:

  • ATLAS (Andrej) reports (raw view) -
    • issue with FTS3 callbacks not working properly, already in contact with FTS developers
    • issue due to CONDOR update of the pilot factories (missing shared library)
    • Reminder: next week is ATLAS Software and Computing week. Most probably, no one from ATLAS will be able to attend the daily meetings.

  • CMS (Christoph) -
    • Main activity: Digi-Reco of upgrade (2020+) MC at Tier-1s
    • Next round of tape staging tests at Tier-1s started today
    • Consistency checks (Site inventory vs. CMS catalogue) at all Tier-1s ongoing

  • ALICE (Maarten) -
    • continued high activity

  • LHCb (Mark) reports (raw view) -
    • "Legacy Run1 Stripping" waiting on FTS transfers and recovery from lost files. Large MC campaign upcoming + User jobs but not much happening right now.
    • T0: NTR
    • T1: Status of the incident report for the lost files at SARA

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: ongoing transparent upgrade of linux kernel and glibc
  • FNAL: found issue in the latest GGUS release, message was sent to a different mail address (see Pablo explanation)
  • GridPP: NTR
  • IN2P3: NTR
  • JINR:
  • KISTI:
  • KIT: NTR
  • NDGF:
  • NL-T1: NTR
  • NRC-KI:
  • OSG: NTR
  • PIC:
  • RAL:
    • on Monday morning deploy hotfix for CASTOR
    • on Tuesday switch to the network backup link for OPN
  • TRIUMF:

  • CERN batch and grid services:
    • Reminder about VOMRS changes. A testing instance is available and experiments are welcome to test and provide feedback (see GGUS:110227 )
    • Reminder about closure of AFS UI on Monday 2nd Feb (more details here )
  • CERN storage services: NTR
  • Databases:
  • GGUS:
  • Grid Monitoring: NTR
  • MW Officer:

AOB:

-- AndreaSciaba - 2014-12-16

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2015-01-30 - LucaMascetti
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback