Week of 150803

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Maria Alandes (chair, minutes), Christoph Wissing (CMS), Jiri Chudoba (ATLAS), Ben Jones (Grid&Batch), Mark Slater (LHCb), Belinda Chan Kwok Cheong (Storage)
  • remote: Dima Kovalskyi (CMS), Dimitri (KIT), Alexander Verkooijen (NL-T1), Michael Ernst (BNL), Gerard Bernabeu (FNAL), Gareth Smith (RAL), Antonio Falabella (CNAF), Sebastian Gadrat (IN2P3)

Experiments round table:

  • ATLAS reports (raw view) -
    • RAL network outage for 1 hour (affected mainly squid availability)
    • Monitoring shows wrong status of Squid (service is up, monitoring says down)

Discussion on how to improve the monitoring is taken offline among ATLAS and Ops Coordination to see how to solve this issue.

Nothing to report.

  • ALICE -
    • high activity
    • CERN: a lot of the raw data currently under processing could not be accessed from CASTOR during the weekend.
      • Disk pool was under heavy load.
      • Some mitigation was applied by the CASTOR team.
      • A fix is scheduled for Tue Aug 4.

  • LHCb reports (raw view) -
    • Data Processing: data "stripping" productions ongoing. User and MC jobs
    • T0
      • Discussion with LSF team about wall and cpu time queue lengths, ongoing (GGUS:115027). Some progress and bugs found on both sides. Deployment of fixes happening in progress.
    • T1
      • RAL: LAN network instabilities continue, CVMFS and data access problems over the weekend.

Sites / Services round table:

  • ASGC: NA
  • BNL: Very high activity regarding tape migration over the weekend. E.g. on Saturday data volume of 80TB/70k files was moved to tape in 24 hours. It's the highest activity registered for tape storage so far.
  • CNAF: NTR
  • FNAL: There will be a downtime on Wednesday 5th August from 8h to 17h where all storage services will be rebooted.
  • GridPP: NA
  • IN2P3: NTR
  • JINR: NA
  • KISTI: NA
  • KIT: A big downtime is planned between 30th September to 7th October. The whole site will be offline. GOCDB details will be given soon.
  • NDGF: Two upcoming warning downtimes for srm.ndgf.org, which affects availability of some ALICE and ATLAS data:
  • NL-T1: Scheduled downtime tomorrow afternoon for the tape backend that will last 24h.
  • NRC-KI: NA
  • OSG: GGUS:115413 opened to site but in fact related to Global Xrootd redirector. Maria will follow up after the meeting.
  • PIC: NA
  • RAL: Outage scheduled tomorrow to check routers causing the network problems affecting RAL during the weekend. T1 will be unavailable for 6h. Details in GOCDB
  • TRIUMF: NA

  • CERN batch and grid services: NTR
  • CERN storage services: A set of transparent CASTOR upgrades happening this week:
  • Databases: NA
  • GGUS: NA
  • Grid Monitoring:
    • Draft availability reports for July sent, and available at the SAM3 UI
  • MW Officer: NA

AOB: None

Thursday

Attendance:

  • local: Maria Alandes (chair, minutes), Christoph Wissing (CMS), Jiri Chudoba (ATLAS), Iain Bradford Steers (Grid&Batch), Mark Slater (LHCb), Belinda Chan Kwok Cheong (Storage), Asa and Cheng-Hsi (ASGC)
  • remote: Dennis van Dook (NL-T1), Michael Ernst (BNL), Gerard Bernabeu (FNAL), John Kelly (RAL), Sebastian Gadrat (IN2P3), Kyle Gross (OSG), Sang Uhn (KISTI), Peter Urkedal (NDGF), Di Qing (TRIUMF)

Experiments round table:

  • CMS reports (raw view) -
    • High activity due to a new wave of MC requests
      • More than 100k parallel jobs in the system
      • No major issues
    • Followup about xrootd redirector issue from Monday (GGUS:115413, still need to be updated)
    • Last weekend the US regional redirector(s) run into trouble
    • On Monday the global CMS redirector at CERN went into trouble GGUS:115437
      • One host (behind the alias) already upgraded to a more recent xrootd version
  • ALICE -
    • high activity
    • CERN: thousands of files still persistently inaccessible from CASTOR
      • reported as 'STAGED', but inaccessible from the WN or interactively
        • xrootd or rfcp
      • many retries were done over several days
      • also after the interventions earlier this week
      • the CASTOR team are aware

  • LHCb reports (raw view) -
    • Data Processing: data "stripping" productions ongoing. User and MC jobs
    • T0
      • Discussion with LSF team about wall and cpu time queue lengths, ongoing (GGUS:115027). Some progress and bugs found on both sides. Deployment of fixes happening in progress.
    • T1
      • RAL: All network problems appear to be solved.

Sites / Services round table:

  • ASGC: Scheduled downtime on Monday 10th August affecting several services. For more details, please check GOCDB
  • BNL: NTR
  • CNAF: NA
  • FNAL: NTR
  • GridPP: NA
  • IN2P3: NTR
  • JINR: NA
  • KISTI: NTR
  • KIT: NA
  • NDGF: NTR
  • NL-T1: NTR
  • NRC-KI: NA
  • OSG: Follow up of ticket mentioned on Monday, things seem to be better.
  • PIC: NA
  • RAL: The network intervention finished successfully and there are now redundant routers.
  • TRIUMF: NTR

  • CERN batch and grid services: CEs were down today due to an intervention in the LSF filers that took longer than expected. Submission is back online and everything working as expected. More details in SSB.
  • CERN storage services:
    • Upgrade of CMS AAA redirectors to XrootD 4.2.2: 2nd node will be upgraded next Tuesday, so far we didn't experience any lockup of the CMSd on the upgraded node.
  • Databases: NA
  • GGUS: NA
  • Grid Monitoring:

  • MW Officer: NA

AOB: None

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2015-08-11 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback