Week of 140811

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Maria Dimou (chair, notes), Maria Alandes (WLCG Operations), Alessandro di Girolamo (ATLAS), Andrej Filipcic (ATLAS), Lorena Lobato (Databases), Daniel Pek (Grid services), Tsung-Hsun Wu (ASGC), Xavier Espinal (Storage).
  • remote: Sang-Un Ahn (KISTI), Thomas Roeblitz (NDGF), Jeremy Coles (GridPP), Michael Ernst (BNL), Rob Quick (OSG), Tiju Idiculla (RAL), Dimitri Nilsen (KIT), David Bouvet (IN2P3), Onno Zweers (NL-T1), Burt Holzman (FNAL), Vladimir Romanovski (LHCb).

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
      • Rucio prod server 2 nodes our of 5 unavailable
    • Tier0/1
      • FZK GGUS:107583 many lost heartbeat. Problems between ATLAS pilot and SGE batch sys: ATLAS developers are working on it. In the meanwhile, maxmemory parameter in AGIS modified to be 16G.
      • CERN-PROD issues in particular with Sim@P1 GGUS:107504 CERN experts on the issue.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • MC and User jobs mostly
      • T0: Alarm ticket (GGUS:107587) EOSLHCB unavailable. SRM-EOSLHCB restarted.
      • T1:

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: not connected
  • FNAL: ntr
  • GridPP: ntr
  • IN2P3: ntr
  • JINR: not connected
  • KISTI: ntr
  • KIT: ntr
  • NDGF: ntr
  • NL-T1:
    • To resolve LHCb ticket GGUS:107422 (problem with a polish site) SARA switched on MTU probs. LHCb is happy the problem is gone.
    • On a more general tone, based on experience from the above ticket, Alessandro asked whether PerfSONAR is used globally to monitor network issues proactively. Maria A. said this is being addressed in WLCG Ops Coord and the relevant Task Force will report at the next meeting.
    • Alessandro also said he emailed the NL_T1 management list, last Friday with a request for a high memory queue. Onno hadn't seen that mail yet. He would check with Jeff Templon and reply. In general, it is safer to open and link GGUS tickets from such meetings' twiki pages.
  • OSG: CERN BDII problem reported in GGUS:107621. Maria A. observed the inconsistencies and degrading of the service observed during the weekend and is now following-up with CERN Grid Services.
  • PIC: not connected
  • RAL: ntr
  • RRC-KI: not connected
  • TRIUMF: not connected

  • CERN batch and grid services:
    • Tomorrow migration of /cvmfs/atlas.cern.ch and /cvmfs/alice.cern.ch stratum 0 to cvmfs 2.1. Transparent for all other users of the repository.
    • INCIDENT An unknown stratum 1 has been iptable blocked at the stratum 0 for causing excessive load. Being followed up. Has caused some slowness in the stratum 0 for writing.

  • CERN storage services: In addition to the LHCb ALARM, an EOS CMS namespace problem was observed and the service was restarted.
  • Databases: ntr
  • GGUS: The email problems at KIT were caused by a mixture of issues. Main reason was an overload of the incoming mail server as well as an overload of the outgoing mail server. Additionally some connectivity problems led to time-outs for connections to DNS service. Report by Guenter Grein.
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Maria Dimou (chair, notes), Maria Alandes (WLCG Operations), Alessandro di Girolamo (ATLAS), Lorena Lobato (Databases), Daniel Pek (Grid services), Andrea Manzi (MW Officer), Felix Lee (ASGC).
  • remote: Sang-Un Ahn & You jin Cha (KISTI), Thomas Roeblitz (NDGF), Michael Ernst (BNL), Elisabeth Prout (OSG), John Kelly (RAL), Thomas Hartmann (KIT), Emmanuel Vamvakopoulos (IN2P3), Onno Zweers (NL-T1), Burt Holzman (FNAL), Vladimir Romanovski (LHCb).

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
    • Tier0/1
      • Tier1s without high memory queues have been kindly requested to provide some resources and a queue. GGUS:107623 INFN-T1 GGUS:107624 NL-T1, GGUS:107625 pic . GGUS:107626 TAIWAN-LCG2
      • FZK "lost heartbeat "issues reported last weeks now in JIRA https://its.cern.ch/jira/browse/ADCSUPPORT-3861 , discussed internally to ATLAS
      • TRIUMF (TRIUMF-LCG2-MWTEST SCRATCHDISK) GGUS:107680 file transfer issue: it's on ATLAS, some trasnsfers are hopping on a test scratchdisk endpoint which is then not able to handle the load: the right hop should be the official TRIUMF-LCG2_SCRATCHDISK endpoint, now trying to solve the issue.
      • FZK GGUS:107672 storage issues. From the site: the issues are present under high load, vendor is not available now, following up
      • IN2P3-CC GGUS:107650 transfer issue due to high load, now it should be ok.
      • TAIWAN-LCG2 GGUS:107653 transfer issues

  • CMS reports (raw view) -
    • Sorry I lost sense of time. Forgot the meeting.
    • GGUS:107621 We had an interesting problem with BDII overload by some CMS users who were running with their own grid tools. Not sure if, how, when it is worth to follow up on how to prevent/react.
    • GGUS:107578 odd CERN-EOS failures from Tier0. Why no reply yet ?

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • MC and User jobs
    • T0: Pilots submission problem (GGUS:107663); Fixed
    • T1: SARA, GRIDKA: Impossible to replicate user's file (GGUS:107655). Brazilian certificate? Thomas H. will discuss with dCache experts at KIT past problems with certificates issued from the brazilian CA. Andrea M. said problems with some ssl library are observed in some cases but not always. We should collect these correlations and discuss the brazilian CA issues offline.

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: not connected
  • FNAL: ntr
  • GridPP: not connected
  • IN2P3: ntr
  • JINR: not connected
  • KISTI: We had a network issue on OPN link due to the power cut within US that made the link between Amsterdam and New York unavailable. This had been lasted for 24hours since 11th August 18:14 in Central European Time. We observed that this did not affect job submission but affect job efficiency down to 10%, presumably due to time-out with accessing remote files or databases before and/or after job execution. And also we observed that to retrive registered proxy information as well as to get delegation were failed due to time-out with connecting myproxy.cern.ch. We will discuss with KISTI kreonet service team to find out a good solution to avoid this kind of issue. Sang-Un will be absent for 3 weeks. If there are issues with the VOs and the site, he will send KISTI's report by mail.
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: ntr
  • OSG: sent by email due to bad tel. line "we’ve been monitoring the performance of the BDII in around this issue this OSG ticket. The performance issues have resolved.
  • PIC: not connected
  • RAL: ntr
  • RRC-KI: not connected
  • TRIUMF: not connected

  • CERN batch and grid services:
    • BDII incident: lcg-bdii was subject to what was effectively a DoS from grid worker nodes in Wisconsin. The root cause of this was a misconfiguration on some CMS jobs, which queried bdii "too much or too badly". We banned BDII access from Wisconsin yesterday, but since then the jobs have been remediated. We're following up with BDII developers to configure BDII to be able to cope better with this kind of load, mainly by increasing thread pool of LDAP. Maria A. arranged with the service manager to apply 64 threads on the top CERN BDII, a measure that solves the problem. The ban to Wisconsin can be lifted.
    • CvmFS repos for /cvmfs/atlas.cern.ch and /cvmfs/alice.cern.ch Strat0 migrated to 2.1. All okay though atlas took a full day rather than the hoped half day.
  • CERN storage services: no report
  • Databases: ntr
  • GGUS: Report by Guenter: The GGUS mail parsing is working again since yesterday morning. Missing mails have been processed and added to the relevant tickets if possible. However they may not appear in the correct order in the ticket history. This morning the reason was understood. The mail parsing problem was caused by a wrong configuration of our exim mailer. However we do not completely understand why the problem occurred right now as this part of the exim configuration handling domains has not been changed in the recent months. We assume changing the outgoing mail server for exim may have had some impact.
  • Grid Monitoring: no report
  • MW Officer: The Stratum One update at CERN went well. All CVMFS clients are now upgraded.

AOB: Emmanuel said that the GGUS mail parsing may still be missing updates. Two GGUS tickets are missing yesterday's updates in their diary. Maria D. suggested Emmanuel to open a GGUS ticket for the developers including the exact ticket numbers with the missing updates.

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2014-08-14 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback