Week of 140623

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Maria Alandes (chair, minutes), Prasanth Kothuri (IT-DB), Felix Lee (ASGC), Maarten Litmaath (ALICE), Raja Nandakumar (LHCb)
  • remote: Michael Ernst (BNL), Oliver Gutsche (CMS), Tiju Idiculla (RAL), Lisa Giacchetti (FNAL), Dmitry Nilsen (KIT), Rolf Rumler (IN2P3-CC), Boris Wagner (NDGF), Onno Zweers (NL-T1)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
      • DDM voboxes: failing from time to time: ss-ft (atlas-SS01) and ss-staging (atlas-SS09)
    • T0/T1s
      • NIKHEF-ELPROD: transfers to SARA-MATRIX. GGUS:106397. The site is investigating
      • FZK-LCG2: transfers to CERN-PROD. 25% transfer failures. Connection timed out. GGUS:106398

  • CMS reports (raw view) -
    • CERN
      • intermittent xrootd read errors from EOS, tickets: GGUS:105678, GGUS:105836, GGUS:106113
      • problem is not reproducible, but causes a significant file read failure rate
      • all tickets have been moved from EOS 3rd Line Support to group: EOS Functional Manager
    • T1 sites

Oliver adds for the EOS problems that CMS has given all the possible feedback in order to help debugging the problem. All the details are now in the relevant tickets.

  • ALICE -
    • CERN: SLC6 CEs published bad job numbers for a short while Sun afternoon (GGUS:106399)

Maria asks whether this refers to the numbers published by the BDII. Maarten confirms that this are indeed the numbers of running and waiting jobs published by the CREAM CE resource BDIIs. Maarten adds that this affects the ALICE job submission engine and that it seems to come from a misconfiguration in the LSF information providers. The problem went away with no intervention but this has been reported for the record.

  • LHCb reports (raw view) -
    • Main activity: User jobs continuing, test reprocessing done and preparing for more, larger MC productions running again
    • T0: CASTORLHCB seems down currently INC0583679
    • T1:
    • Various sites :
      • BDII and SRM publish inconsistent storage information (RAL - GGUS:105571 , CBPF - GGUS:105572)
      • Multiple sites (SARA, RRCKI) provided extra storage last Friday - thanks!

Maria mentions that she will follow up the file protocol issue with the dCache developers. Regarding the BDII and SRM inconsistencies in RAL, Raja adds that this will be presumably fixed with the next CASTOR upgrade scheduled this week.

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: no report
  • FNAL: ntr
  • GridPP: no report
  • IN2P3: ntr
  • JINR: no report
  • KISTI: no report
  • KIT: ntr
  • NDGF: Some dCache ATLAS files were corrupted. These have been now deleted. ATLAS is aware of it.
  • NL-T1: Warning downtime of 1h tomorrow afternoon.
  • OSG: no report
  • PIC: no report
  • RAL: A CASTOR upgrade will happen for the whole day on Tuesday for ALICE and on Thursday for LHCb.
  • RRC-KI: no report
  • TRIUMF: no report

  • CERN batch and grid services: no report
  • CERN storage services: no report
  • Databases: A series of interventions are scheduled for next week:
    • ATONR : Monday 30th June 14:00-16:00 (database unavailable: 14:00 – 14:20)
    • ADCR: Tuesday 1st July 09:00-12:00 (database unavailable: 09:00 – 09:20)
    • ATLR : Tuesday 1st July 09:00-12:00 (database unavailable: 09:00 – 09:20)
    • ATLARC: Tuesday 1st July 14:30-17:30 (database unavailable: 14:30 – 14:50)
    • CMSONR & CMSONR ADG: Wednesday, June 25th, 14:00 - 17.00
    • CMSARC: Tuesday, June 24th, 14:00 - 16.00
    • CMSR: Wednesday, June 25th, 10:00 - 12.00
  • GGUS: no report
  • Grid Monitoring: no report
  • MW Officer: no report

AOB:

None

Thursday

Attendance:

  • local: Maria Alandes (chair, minutes), Xavier Espinall (Storage), Marcin Blaszczyk (IT-DB), Felix Lee (ASGC), Maarten Litmaath (ALICE), Raja Nandakumar (LHCb), Ignacio Reguero (Batch and Grid services)
  • remote: Jeremy Coles (GridPP), Dennis van Dok (NL-T1), Michael Ernst (BNL), Antonio Falabella (CNAF), Pepe Flix (PIC), Lisa Giacchetti (FNAL), Kyle Gross (OSG), Thomas Hartmann (KIT), Rolf Rumler (IN2P3-CC), Gareth Smith (RAL), Boris Wagner (NDGF), Rob Quick (OSG)

Experiments round table:

  • CMS reports (raw view) -
    • No major issues, processing and production is continuing at high scales
    • about GGUS tickets about CERN EOS mentioned in mondays' reports: we are trying to assess if problems are still present and in which amount.
    • GGUS tickets with KIT mentioned monday have been solved. Thanks.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Main activity: User jobs continuing, test reprocessing done and preparing for more, larger MC productions running again
    • T0:
      • CASTORLHCB down on monday due to a single user hitting it. Fine now after the user was banned.
      • Brazilian proxy problem on EOS (GGUS:106434)
    • T1:
      • GridKa : CVMFS problem on WN (GGUS:106405) ongoing
      • RAL : GGUS:105571 - inconsistent storage information publishing actually a sensor problem (mixup of tape and disk info)?

Raja explains for the CASTORLHCB incident that the concerned user is being trained to avoid a similar incident in the future. Xavier explains that on the CASTOR side there is currently a limitation explicitely set for this user. Raja mentions that this could be removed next week, LHCb will get back to CASTOR.

Maria confirms that it is now understood the inconsistent BDII vs SRM numbers published by RAL in the dashboard (See Information System report at the bottom of the minutes).

Maarten suggests to ping again the EOS people for the Brazilian proxy ticket since this seems to be a real bug that can be an issue for everyone. Maria suggests to wait for the EOS people answer and maybe follow up next week if necessary.

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • GridPP: ntr
  • IN2P3: There will be an outage of the batch system between 9am and 12am on Tuesday 1st July.
  • JINR: no report
  • KISTI: no report
  • KIT: ntr
  • NDGF: no report
  • NL-T1: ntr
  • OSG: As already discussed with the SAM team after a phone conference, a testing CE to be able to understand whether HTCondor CE works with SAM is being identified. It will probably be in Nebraska, still to be confirmed.
  • PIC: Exporting of CMS data via xrootd is now enabled. Work ongoing to enable also xrootd monitoring.
  • RAL: The LHCb CASTOR upgrade was OK. The final one is the ATLAS CASTOR upgrade scheduled for Tuesday 8th of July.
  • RRC-KI: no report
  • TRIUMF: no report

  • CERN batch and grid services: ntr
  • CERN storage services: CASTOR DB will undergo a set of rolling interventions due to security patches that need to be applied by the DB team (this should be more or less transparent for the users):
    • CMS and ATLAS: Monday 30th June at 2pm
    • ALICE and LHCb: Tuesday 1st July at 9 am
    • Nameserver: Wednesday 2nd July at 10 am
  • Databases: Rolling interventions due to the same security patches will also affect:
    • LCGR: Wednesday 2nd July at 10:30 am
    • LHCb Online: Thursday 3rd July at 1:30 pm
    • ATLAS: Apart from the rolling intervention due to the security patches, there will be 20min downtime of the DB at the very beginning of the intervention to apply some parameter changes in the DBs:
      • ATLAS online: Monday 3th June at 2 pm
      • ATLAS offline: Tuesday 1st July at 9 am
      • ATLAS archive: Tuesday 1st July at 2:30 pm
  • GGUS:
  • Grid Monitoring:
  • MW Officer:
    • Workaround for dCache publishing file/nfs protocols documented in MW Issues table. A fix will be released in version 2.6.31 to be included in the the next EMI update scheduled for the 7th of July. Note that this issue affects dCache versions >= 2.2
  • Information Systems:
    • RAL BDII Storage Capacities for LHCb are now in sync with the numbers reported with SRM (See Dashboard). There is just one mismatch with the Tape that is being sorted out with RAL (numbers are published in both online and nearline attributes when normally for tape online attributes should be 0). See GGUS:105571.
    • New T1 storage deployment metrics collected from the BDII have been created in the Dashboard. Note that the MW officer is still planning to maintain the Storage Deployment table for the time being, since planned changes cannot be collected automatically for the dashboard. The dashboard offers an up to date status of the storage deployment and also historic information as versions get updated. T1s have been contacted in the past days when some of the information published in the BDII was not coherent. Opened GGUS tickets are tracked here.

AOB:

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2014-07-07 - MariaALANDESPRADILLO
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback