Week of 150413

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Alessandro (ATLAS), Belinda (storage), Lorena (databases), Maarten (SCOD + ALICE), Raja (LHCb)
  • remote: Alexey (PIC), Di (TRIUMF), Dimitri (KIT), Felix (ASGC), John (RAL), Lisa (FNAL), Onno (NLT1), Rob (OSG), Rolf (IN2P3), Sang-Un (KISTI)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
      • Nothing urgent to report: couple of squids down at 2 Tier1s, GGUS ticketed them and services were restarted.
      • not moved back yet the DDM load on FTSes at RAL, will do in the next hours/days.
        • John: various experts will be occupied at CHEP
        • Alessandro: we are in communication via e-mail and this week would be OK for them; we will follow up offline

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • User and MC jobs are running in the system. The failure rate of the jobs is very low.
    • CERN & Tier-1s running fine
      • Problem of failing LHCb jobs solved promptly by RAL admins last Friday evening
    • FTS transfer problems to two Tier-2 sites - Grif (GGUS:112985), Nipne (GGUS:112044)
      • Alessandro: are the issues due to the FTS? if so, we would be interested
      • Raja: one looks due to the network, the other due to the local SE
    • VOMS-Admin issues waiting for solution from VOMS admins : GGUS:112282 and GGUS:112279
      • Maarten: the first ticket is waiting for a new release, I will "ping" the second

Sites / Services round table:

  • ASGC: ntr
  • BNL:
  • CNAF:
  • FNAL: ntr
  • GridPP:
  • IN2P3:
    • batch system outage for upgrade: draining starting on Sun Apr 26, remaining jobs killed Mon morning Apr 27, service back at noon
  • JINR:
  • KISTI: ntr
  • KIT: ntr
  • NDGF:
  • NL-T1:
    • last week the disk storage was expanded to the new pledge level
    • the computing cluster was expanded with 100 nodes of 24 cores each and 8 GB RAM per core
    • a Squid service went down in the weekend and was restarted OK
  • NRC-KI:
  • OSG:
    • issues with the latest monthly WLCG accounting report are being followed up, a few sites have been ticketed
  • PIC:
    • running with 70% of the CPU capacity; the remainder is waiting for full recovery of the power and cooling infrastructure
  • RAL:
    • the CASTOR upgrade to 2.1.14-15 that was canceled last week will happen this Wed
  • TRIUMF: ntr

  • CERN batch and grid services:
  • CERN storage services:
    • CASTOR-ATLAS upgrade to 2.1.15-3 tomorrow 09:00-12:00 (the other 3 experiments already have it)
  • Databases: ntr
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Alessandro (ATLAS), Andrea (MW Officer), Lorena (databases), Maarten (SCOD + ALICE), Raja (LHCb)
  • remote: Alexey (PIC), Dennis (NLT1), Di (TRIUMF), Felix (ASGC), Jens (NDGF), Kyle (OSG), Lisa (FNAL), Matteo (CNAF), Pavel (KIT), Rolf (IN2P3), Sang-Un (KISTI), Tiju (RAL), Tommaso (CMS)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
      • Castor intervention done yesterday ok, and EOS namespace was also restarted (quota set issue, now solved)
      • FTS3 RAL put back preprod. https://fts3-test.gridpp.rl.ac.uk:8446 237 DDM ep, fts3.cern.ch 244 DDM ep, fts.usatlas.bnl.gov 138 DDM ep.

  • CMS reports (raw view) -
    • all quiet, not a huge activity at sites (waiting to launch RECO for 2015 MC samples)
    • Eu redirector Xrootd.ba.infn.it had problems twice along the week: a daemon crashed due to "too many threads" (but apparently still below the limit set). Being investigated with developers. It should have no effect on readiness (gets a "Warning"), but for 2 sites (T2_HU_Budapest and T2_IN_TIFR) became a "Critical"; not sure why, investigating.
    • reached the quota limit on EOS unmerged area, since now also the T0 writes there. Understanding and setting new quotas.
    • some sites are complaining about "hot files" being requested much more than the average. This is understood at our SW level: the first pile-up file is always opened, in order to read metadata. We are trying to find solutions.
    • some small issues reported:
      • GGUS:113049 (CERN) Castor-T0CMS instance in bad shape for ~24 hours, then recovered. Not sure we understood why.
      • GGUS:113036 T0->T1 tests were failing for a few days, which turned red all the T1s for readiness. Mostly corrected, apart from 2 days still red which will need another correction.
      • GGUS:112942 (CERN) ARGUS not answering correctly for some DNs. Being followed, and the moment cured by adding servers but real issue non understood (as far as I understand).

  • ALICE -
    • high activity

  • LHCb reports (raw view) -
    • Mostly user and MC jobs in the system.
    • FTS transfer problems to Nipne continuing (GGUS:112044)
    • VOMS-Admin issues waiting for solution from VOMS admins / devs : GGUS:112282 and GGUS:112279
    • Raja: thanks to RAL for the smooth CASTOR upgrade on Wed

Sites / Services round table:

  • ASGC: ntr
  • BNL:
  • CNAF: ntr
  • FNAL: ntr
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI: ntr
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: ntr
  • NRC-KI:
  • OSG: ntr
  • PIC:
    • CMS opened a ticket against PIC because of slow response times experienced by jobs:
      • there was too much activity by CMS, running at 1.5 times their pledge plus doing many transfers etc.
      • the ticket was therefore bounced back to CMS support
  • RAL:
    • the CASTOR upgrade on Wed went OK
  • TRIUMF:
    • next Wed from 17:00 to 21:00 UTC dCache will be upgraded to 2.10.24 and other services will be upgraded too

  • CERN batch and grid services:
  • CERN storage services:
  • Databases: ntr
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

-- AndreaSciaba - 2015-02-27

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx MB-Apr-15.pptx r1 manage 2865.2 K 2015-04-13 - 17:05 PabloSaiz  
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2015-04-16 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback