Week of 150817

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: AndreaV/SCOD, Michal/ATLAS, LucaM/Storage, Jesus/Storage, Alberto/Grid, Andrei/Databases
  • remote: LucaT/LHCb, Michael/BNL, Lisa/FNAL, Jeremy/!GridPP, Sang-Un/KISTI, Christian/NDGF, Onno/NLT1, Pepe/PIC, John/RAL

Experiments round table:

  • ATLAS reports ( raw view) -
    • BNL - transfers to LOCALGROUPDISK were because of insufficient space. BNL added space
    • RAL - T0 export to DATATAPE is failing with timeouts
    • CERN - T0 export from TZDISK is failing with "could not open connection to srm-atlas.cern.ch"

  • ALICE -
    • High activity.
    • CERN: reco jobs at times still unable to read raw data files from CASTOR.
      • The CASTOR experts are aware and applied a temporary cure on Sun, thanks!
      • To be followed up.

  • LHCb reports ( raw view) -
    • Data processing: validation of 25ns data has started, ~70% already processed
    • No issues from Tier0/1

Sites / Services round table:

  • ASGC: -
  • BNL/Michael: ntr
  • CNAF: -
  • FNAL/Lisa: ntr
  • GridPP/Jeremy: ntr
  • IN2P3: ntr
  • JINR: -
  • KISTI/Sang-Un: ntr
  • KIT: -
  • NDGF/Christian: there will be a downtime on Wednesday, data will be unavailable in the evening
  • NL-T1/Onno:
    • RAID issue on a dcache pool node, fixed this morning but still need to investigate (had the same problem 5 weeks ago), will send it back to the vendor after moving the data (migration should be complete on Wednesday)
    • following up another issue with dcache that was disabling pools: fixed today one of the two reasons for this and greatly reduced the issue (fixed a bug in the mass storage script); the other reason is a timeout in the Berkeley DB, for which a workaround exists (moving the DB to a different block device)
  • NRC-KI: -
  • OSG: ntr
  • PIC/Pepe: ntr
  • RAL/John: at risk tomorrow for one of the site routers
  • TRIUMF: -

  • CERN batch and grid services / Alberto:
    • New CE based in HTCondor CE (ce503) broadcasting in BDII.
  • CERN storage services / LucaM:
    • following up the issues reported by ALICE and by ATLAS
    • will do an emergency change in CASTOR tomorrow for CMS to allow third party copy (had an issue last week)
  • Databases / Andrei : ntr
  • GGUS: -
  • Grid Monitoring:
    • Availability and Reliability reports for July sent to the WLCG office
  • MW Officer: -

AOB: ntr

Thursday

Attendance:

  • local: AndreaV/SCOD, Michal/ATLAS, LucaT/LHCb, Christoph/CMS, Alberto/Grid, LucaC/Databases, Jesus/Storage
  • remote: Dennis/NLT1, Christian/NDGF, John/!RAL, Sang-Un/KISTI, DiQing/TRIUMF, Jeremy/!GridPP, Pepe/PIC, Rolf/IN2P3, Thomas/KIT, Rob/OSG

Experiments round table:

  • ATLAS reports ( raw view) -
    • CERN - expired credentials failures - host certificate in the two machines updated
    • CERN - "Unable to issue PrepareToGet request to Castor"
      • ATLAS is replicating files which are in Taiwan and CERN but transfers from Taiwan are failing
      • to improve the situation, smaller BringOnLine requests should be used - can be done in next FTS release
    • SARA - no pilots because of DNS issue
    • pic - no pilots at HIMEM

  • CMS reports ( raw view) -
    • Holiday times - not much Computing activities
    • Transfer problem from CASTOR to (CERN) EOS
      • Issues with third-party transfers
      • Needed a fix on CASTOR
      • Details: GGUS:115598
    • Outages of DAS (Data Aggregation Service) last week
      • Service got overloaded and introduced further problem to other CMS web-services
      • Some tools are changed to query underlying services directly (bypassing the Aggregation Service)

  • ALICE -
    • High activity.
    • NIKHEF: CVMFS was stuck after an update from CentOS 6.6 to 6.7 (GGUS:115783)

  • LHCb reports ( raw view) -
    • Data Processing: second run of validation for 25ns data started with new calibration, ~90% already processed. New stripping production will start later today.
    • T0
      • Nothing to report
    • T1
      • Nothing to report
      • CVMFS problem at NIKHEF-ELPROD (GGUS:115767) due to an upgrade of their WNs from CentOS 6.6 to CentOS 6.7. Promptly solved.
        • [!LucaT: is CVMFS issue understood? Is there a risk that other sites see the same issue if they upgrade? Dennis: this was supposed to be a standard upgrade on WNs and VO boxes, but cvmfs got stuck and had to do an autofs restart; it's not the first time that this issue happens. AndreaV: please report to the cvmfs team if not already done. Dennis: ok will do.]
      • PBS issues at PIC (GGUS:115748) due to a high load in few WNs, pbs server had problems to refresh information, and denied some connections. Promptly solved.

Sites / Services round table:

  • ASGC: -
  • BNL: -
  • CNAF: -
  • FNAL: -
  • GridPP/Jeremy: ntr
  • IN2P3/Rolf: ntr
  • JINR: -
  • KISTI/Sang-Un: ntr
  • KIT/Thomas: ntr
  • NDGF/Christian: ntr
  • NL-T1/Dennis: nta
  • NRC-KI: -
  • OSG/Rob: there was a mail problem at FNAL yesterday, if anyone sent tickets to OSG these have not reached their destination so please resend them
  • PIC: -
  • RAL/John: ntr
  • TRIUMF/!DiQing: ntr

  • CERN batch and grid services/Alberto: ntr
  • CERN storage services/Jesus: ntr
  • Databases/!lLucaC: ntr
  • GGUS: -
  • Grid Monitoring: -
  • MW Officer: -

AOB: -

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx MB-Aug-15.pptx r1 manage 2868.7 K 2015-08-17 - 14:10 PabloSaiz  
Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2015-08-20 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback