Week of 140519

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Maria D (SCOD), Alessandro (ATLAS), Maarten (ALICE), Kate (CERN Databases), Xavier (CERN Storage), Manuel G (CERN Grid Services), Ken (CMS).
  • remote: Michael E (BNL), Sang-Un (KISTI), Lisa (FNAL), Vladimir (LHCb), Jeremy (GridPP), Kyle (OSG), Onno (NL_T1), Ulf (NDGF), Salvatore (INFN-T1), Rolf (IN2P3), Tiju (RAL).

Experiments round table:

  • CMS reports (raw view) -
    • It has been very quiet Usual processing continues.
    • GGUS:105494 is about failing SAM tests at T1_FR_CCIN2P3. I have this from Sebastian: "The SAM tests are in critical state, but the only errors are on the 2 CEs used for the multi-core test: we need to add the "pilot" role as to install stuff for glexec to work. As these CEs are requesting to be kept in "production" for ATLAS, we plan to fully install them as the other production CEs at the next downtime, foreseen on June, 17th. In the meantime, can CMS remove these CEs from the critical services of CCIN2P3 please (at least to avoid receiving ticket for this)?" Some slightly puzzling things here. First is a CMS question -- why do we consider this a failure of the test if at least some of the CE's are working correctly. Others are why the GGUS ticket didn't get responded to for 48 hours, and why the GGUS ticket has a link to a "related issue" that you need a CRNS certificate to see. Rolf answered:
      • to the 1st point that the ticket was not an ALARM, so the on-site engineer would have handled it immediately IF it contained a problem with the correct functioning of the site itself. Else, USER and TEAM tickets are handled during working hours.
      • opening an ALARM when desiring a quick reply is not possible. ALARMs are only possible for problem types covered by the MoU, e.g. severe data transfer problems.
      • On the 2nd issue a GGUS dev. item should be created (see GGUS section below).
    • GGUS:105493 is about transfer problems from T1_UK_RAL; FTS problem, should be on the way to resolution

  • ALICE -
    • NTR

Sites / Services round table:

  • ASGC: The network intervention didn't go well, so, it took more time than expected, it was finished at 11AM UTC.
  • NDGF: NDGF-T1 had a quiet weekend. No downtimes planned. We got a strange and confused Atlas ticket (GGUS:101063) about them not being able to read a file they never wrote.There might be catalogue problem. Alessandro said this was seen also elsewhere, so it will be taken offline and updates will be recorded in the ticket.
  • NL_T1: SARA will be on downtime this Thursday for multiple memory modules' replacement on many dCache nodes. Writing will be possible but reading may fail. GOCDB is updated.
  • IN2P3: nta
  • GridPP: ntr
  • INFN-T1: ntr
  • OSG: GGUS attachment issue being investigated with Guenter.
  • PIC: ntr
  • CERN:
    • Databases: ATLAS ARC db suffered last Fri 16 May with the CERN public network was down between 15:30-17:30CEST.
    • Storage: Same situation for CASTOR last Fri. New release expected this week.
    • Grid Services: ntr

  • GGUS: Due to a public holiday on Thursday 29th in Germany we moved the release from Wednesday, 28th to Monday 26th of May https://wiki.egi.eu/wiki/GGUS:Release_Schedule#2014 https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=14254. The following issues should be recorded in jira tickets of the GGUS tracker:
    1. The ALARM tests this month must be done on Tuesday 27 May and NOT right after the release as usual, due to public holidays in the USA and the UK on Monday.
    2. Some CMS tickets are marked as 'Incident' although they should be classified under ticket category "Documentation". Example GGUS:105098. These may be tickets that were handled so far in the savannah tracker 'CMS computing'? To be discussed in the GGUS dev. team and with Christoph Wissing.
    3. The GGUS "Related Issue" field, sometimes contains links to web sites without permission for GGUS users.
    4. GGUS developers, please record the OSG attachment issue in a jira ticket too with diagnostics etc.

AOB:

Thursday

Attendance:

  • local: Kate (databases), Maarten (SCOD), Manuel (CERN grid services), Pablo (grid monitoring + GGUS), Xavi (CERN storage)
  • remote: Antonio (CNAF), Jeremy (GridPP), John (RAL), Kai (ATLAS), Kyle (OSG), Lisa (FNAL), Michael (BNL), Rolf (IN2P3), Sang-Un (KISTI), Stefano (CMS), Ulf (NDGF), Vladimir (LHCb)

Experiments round table:

  • CMS reports (raw view) -
    • Running on all resources we can get hold of and no major problem
    • we have started treating SAM glexec test as critical. Some sites have problems and tickets have been opened on Monday May 19, a few are still open:
      • GGUS:105538 T1_DE_KIT : acknowledged by the site and being investigated
      • GGUS:105543 T2_FR_IPHC : no detail in the ticket but seems to work now
      • GGUS:105543 T2_IT_Rome : no reply from site yet and failure persists
      • we are following up with sites on those

  • ALICE -
    • NTR

Sites / Services round table:

  • BNL - ntr
  • CNAF
    • the Audioconf web site was very slow!
      • Maarten: we have a ticket open with the experts and will update it with this observation
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3
    • on June 17 there will be an outage affecting most services; details will be provided 1 week before
  • KISTI - ntr
  • NDGF
    • a second ticket was received from ATLAS about another file that could not be accessed from our storage, because it was never written there; the ATLAS catalog may have some issues
  • OSG - ntr
  • RAL
    • the Audioconf web site was also very slow for RAL
      • see the CNAF report above

  • CERN grid services
    • Update on SAM WMS servers next Monday (ITSSB)
  • CERN storage
    • EOS-ATLAS update Tue May 27 10:00-10:30 CEST
    • CASTOR update plans:
      • all interventions should see an actual outage of ~10 min
      • Mon May 26 10:00-10:30 CEST for CMS
        • will fix a bug concerning 3rd party copy checksums
      • Mon Jun 02 08:30-09:30 CEST for ATLAS
      • Tue Jun 03 08:30-09:30 CEST for LHCb
      • Wed Jun 04 08:30-09:30 CEST for ALICE
  • databases - ntr
  • GGUS
    • GGUS release on the 26th. The alarms for UK and USA will be done on the 27th. The rest, on the 26th
    • The Savannah-GGUS bridge for CMS will be decommissioned in this release
    • Need info for can't add attachment
      • Kyle: will follow up

AOB:

-- SimoneCampana - 20 Feb 2014

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx MB-May.pptx r1 manage 2876.4 K 2014-05-19 - 10:57 PabloSaiz  
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2014-05-22 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback