Week of 130114

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance: local (AndreaV, Alessandro, Jerome, Eva, Xavi, Eddie, MariaD, Maarten); remote (Michael/BNL, Gonzalo/PIC, Saverio/CNAF, Matteo/CNAF, Wei-Jen/ASGC, Lisa/FNAL, Christian/NDGF, Tiju/RAL, Renaud/IN2P3, Onno/NLT1, Rob/OSG, Dimitri/KIT; Ian/CMS, Vladimir/LHCb).

Experiments round table:

  • ATLAS reports -
    • BNL-ATLAS GGUS:90352 - not a site issue as discussed in the ticket, the resources behind that PandaResource are opportunistic and it's just that shifters see errors.
    • TAIWAN-LCG2 GGUS:90349 disk server problematic on MCTAPE caused file access problems. [Wei-Jen: disk server issue has been fixed.]
    • Network issue between FZK and 2 polish sites (CYFRONET and PSNC) GGUS:90341 - "Checked on Polish side PIONIER network, no problem found. Notifed DFN-NOC." then transfers worked again after few hours.
    • CERN-PROD EOS SRM/GRIDFTP issue GGUS:90338
    • FTS channels not defined for 2 UK sites on RAL-LCG2 FTS GGUS:90291, now solved, thanks.
    • [Alessandro: last week we also observed many problems at Tier2's, mainly with storage, mentioned in GGUS]

  • CMS reports -
    • LHC / CMS
      • Getting ready for HI
    • CERN / central services and T0
      • There was an issue with Castor transfers from P5 this morning. Not clear if it was a problem with the transfer system or the storage system. [Xavi: this was an issue with a specific server. Xavi: are you using the new workflow and/or observing any issues for lead-proton? Ian: we expect that the rate will not be as high as for heavy ions in 2010, we are watching it but we do not expect any issues.]
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • LHCb reports -
    • Mostly Simulation jobs on all Tier levels
    • This week restart prompt processing at CERN and 2011 data reprocessing at T1 sites + attached T2s
    • T0:
      • VOMS not reachable from outside CERN (GGUS:90295)
    • T1:

Sites / Services round table:

  • Michael/BNL: ntr
  • Gonzalo/PIC: ntr
  • Saverio/CNAF: ntr
  • Wei-Jen/ASGC: an issue with the Castor database is being followed up
  • Lisa/FNAL: ntr
  • Christian/NDGF: coming up after the downtime, all is going smoothly except for some problems with VOMS
  • Tiju/RAL: ntr
  • Renaud/IN2P3: we are still having problems with SRM due to proxies generated by SRM. Due to this issue, last week we had around 30 reboots of SRM, which was quite difficult to handle. We have no news from the dcache people yet. For this reason we want to ask CMS if they can stop these proxies in CRAB. [Ian: will follow up with user education, but techniaclly there is nothing we can do. Maarten: the site can ban the users. Renaud: not sure this would work, some of the proxies come from outside sites. Ian: will follow up offline because we need to analyse this more in detail. Maarten: please IN2P3 open a GGUS ticket for CMS.]
  • Onno/NLT1:
    • last Friday the maintenance of the compute cluster file server lasted more than expected, but all was ok by Saturday
    • last Friday we also had a problem with a proxy that was rebooted but got stuck, affecting CVMFS because the failover did not work; this was fixed by this morning. [Maarten: please follow up with CVMFS people why the failover did not work.]
    • [Alessandro: are you upgrading WNs to SLC6 in January? Onno: will ask my colleagues.]
    • [Alessandro: are last week's problems with CEs fixed? Onno: will also follow up with my colleagues.]
  • Rob/OSG: ntr
    • [MariaD: there is a ticket assigned to Rob for the migration, can you please follow it up]

  • Eva/Databases: we had an issue with storage during the Active DataGuard upgrade for ALICE on Friday, everything was fixed on Saturday
  • Jerome/Grid: ntr
  • Eddie/Dashboard: ntr
  • Xavi/Storage: nta
  • MariaD/GGUS: The January release will take place on 2013/01/23 (earlier than the norm "last Wednesday of the month"). NB! The GGUS-VOMS synchronisation mechanism will change. The reason is to rely on GGUS own server rather than a GridKa one, used so far. Details in Savannah:133063 . WLCG request to allow all VO members to view their TEAMers and ALARMers from within GGUS (not only voms) will also enter production with this release. Details in Savannah:134672 and Savannah:134697 .

AOB: none

Tuesday

Attendance: local (AndreaV, Eddie, Jerome, Eva, Xavi); remote (Saverio/CNAF, Michael/BNL, Xavier/KIT, Ronald/NLT1, Wei-Jen/ASGC, Tiju/RAL, Christian/NDF, Lisa/FNAL; Vladimir/LHCb, MariaD/GGUS).

Experiments round table:

  • ATLAS reports -
    • No data taking till wednesday
    • Nothing else to report.

  • LHCb reports -
    • Mostly Simulation jobs on all Tier levels
    • This week restart prompt processing at CERN and 2011 data reprocessing at T1 sites + attached T2s
    • T0:
      • VOMS not reachable from outside CERN (GGUS:90295): Solved
    • T1:
      • GRIDKA: timeouts of transfers to Gridka Disk storage under investigation (GGUS:88425, GGUS:88906): Downtime for Separation of LHCb from gridka-dcache.fzk.de SE, hopely will solve this problem

Sites / Services round table:

  • Saverio/CNAF: ntr
  • Michael/BNL: ntr
  • Xavier/KIT:
    • there was a downtime for LHCb as mentioned in their report
    • there was also another downtime for upgrading Frontier Squids
  • Ronald/NLT1: there were two open questions yesterday
    • yes there is the intention to migrate SARA to SL6, but this may rather happen in the first half of February rather than in January
    • yes the issue with CEs and CVMFS is fixed
  • Wei-Jen/ASGC: ntr
  • Tiju/RAL: ntr
  • Christian/NDGF:
    • VOMS issue mentioned yesterday is now fixed
    • there will be a short downtime on Thursday morning
  • Lisa/FNAL: ntr

  • Xavi/Storage: added 6 new diskservers for CMS to address the issue mentioned yesterday
  • Jerome/Grid: ntr
  • Eddie/Dashboard: ntr
  • Eva/Databases: another issue with active copies since 6am this morning for ADCR (DB is up but not being synchronised)
  • MariaD/GGUS: ntr

AOB: none

Wednesday

Attendance: local (AndreaV, Xavi, Jerome, Eddie, LucaC); remote (Saverio/CNAF, Matteo/CNAF, John/RAL, Lisa/FNAL, Pavel/KIT, Rob/OSG, Rolf/IN2P3, Ron/NLT1; Vladimir/LHCb, MariaD/GGUS).

Experiments round table:

  • LHCb reports -
    • Mostly Simulation jobs on all Tier levels
    • This week restart prompt processing at CERN and 2011 data reprocessing at T1 sites + attached T2s
    • T0:
      • CASTOR problem: Possible unavailability detected in c2lhcb/lhcbdisk (Wed Jan 16 08:50:58 2013) 2 nodes seem to be unavailable. [Xavi: the CASTOR issue is now fixed.]
    • T1:
      • GRIDKA: timeouts of transfers to Gridka Disk storage under investigation (GGUS:88425, GGUS:88906): Downtime for Separation of LHCb from gridka-dcache.fzk.de SE, hopely will solve this problem

Sites / Services round table:

  • Christian/NDGF [via email before the meeting]: there will be another downtime tomorrow for updates of the dCache headnodes. This will affect all data. The downtime is only expected to last a few minutes as the servers reboot.
  • Saverio/CNAF: ntr
  • John/RAL: ntr
  • Lisa/FNAL: ntr
  • Pavel/KIT: ntr
  • Rob/OSG: ntr
  • Rolf/IN2P3: ntr
  • Ron/NLT1: ntr

  • Xavi/Storage: nta
  • Eddie/Dashboard: ntr
  • Jerome/Grid: ntr
  • LucaC/Databases: issue reported yesterday with ADCR Active Data Guard is due to a lost write and is being investigated with Oracle and Netapp. Active Data Guard for ADCR has been restored yesterday.
  • MariaD/GGUS: ntr

AOB: none

Thursday

Attendance: local (AndreaV, Eddie, Jerome, Xavi, MariaD); remote (Saverio/CNAF, Matteo/CNAF, Michael/BNL, Wei-Jen/ASGC, Ronald/NLT1, WooJin/KIT, Rolf/IN2P3, Rob/OSG, Jeremy/GridPP, Christian/NDGF, Lisa/FNAL, Dimitrios/RAL; Vladimir/LHCb).

Experiments round table:

  • LHCb reports -
    • Mostly Simulation jobs on all Tier levels
    • This week restart prompt processing at CERN and 2011 data reprocessing at T1 sites + attached T2s
    • T0:
    • T1:
      • GRIDKA: timeouts of transfers to Gridka Disk storage under investigation (GGUS:88425, GGUS:88906): Downtime for Separation of LHCb from gridka-dcache.fzk.de SE, hopely will solve this problem. Downtime finished, we are trying to use new SRM endpoint.

Sites / Services round table:

  • Saverio/CNAF: ntr
  • Michael/BNL: ntr
  • Wei-Jen/ASGC: ntr
  • Ronald/NLT1: ntr
  • WooJin/KIT:
    • as reported by LHCb, the downtime is finished
    • a tape library is still down and the vendor is working on it
    • the Squid upgrade is also complete
  • Rolf/IN2P3: ntr
  • Rob/OSG: ntr
  • Jeremy/GridPP: ntr
  • Christian/NDGF: someone has cut our main network fiber, we are up but service quality is degraded
  • Lisa/FNAL: ntr
  • Dimitrios/RAL: ntr

  • Jerome/Grid: ntr
  • Xavi/Storage: ntr
  • Eddie/Dashboard: ntr
  • MariaD/GGUS: Reminder: The January release will take place next Wednesday on 2013/01/23. There will be the usual series of test ALARMs.

AOB: none

Friday

Attendance: local (AndreaV, Xavi, Ignacio, Jerome, Eddie, Eva); remote (Michael/BNL, Lisa/FNAL, Jeremy/Gridpp, Wei-Jen/ASGC, Xavier/KIT, Onno/NLT1, Kyle/OSG, Rolf/IN2P3, Gareth/RAL, Saverio/CNAF, Matteo/CNAF; Vladimir/LHCb, Ian/CMS).

Experiments round table:

  • CMS reports -
    • LHC / CMS
      • Waiting for beam
    • CERN / central services and T0
      • The use of 2 Tier-0 systems (the new as Primary and the old as backup) combined with the writing streamers to tape has the potential to be a high load on Castor. We are watching this and will try to turn off elements as soon as we are convinced the new system results are fully reliable. [Xavi: this is in line with what we have reported on Monday. We are watching this too and we'll keep in contact with CMS.]
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • ALICE reports -
    • CERN: expired CRL being served for the Polish CA, team ticket GGUS:90539 in progress. [Jerome: this was solved half an hour ago.]

  • LHCb reports -
    • Mostly Simulation jobs on all Tier levels
    • This week restart prompt processing at CERN and 2011 data reprocessing at T1 sites + attached T2s
    • T0:
    • T1:
      • GRIDKA: After few fixes LHCb started transfers. The error level is less we had before, and we have no SRM errors anymore

Sites / Services round table:

  • Michael/BNL: ntr
  • Lisa/FNAL: ntr
  • Jeremy/Gridpp: ntr
  • Wei-Jen/ASGC: one Castor issue last night due to high load, now fixed
  • Xavier/KIT: ntr
  • Onno/NLT1: ntr
  • Kyle/OSG: ntr
  • Rolf/IN2P3: ntr
  • Gareth/RAL: ntr
  • Saverio/CNAF: ntr

  • Eddie/Dashboard: ntr
  • Xavi/Storage: nta
  • Jerome/Grid: nta
  • Ignacio/Grid: LFC upgrade is stopped waiting for ATLAS, can they confirm if we can go ahead? [AndreaV: will follow up with ATLAS].
  • Eva/Database: ntr

AOB: none

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2013-01-18 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback