Week of 150223

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Stefan (SCOD/LHCb), Belinda (Storage), Alessandro (ATLAS), Steve (Grid Services),
  • remote: Onno (NL-T1), Lisa (FNAL), Tiju (RAL), Sang-Un (KISTI), Dimitri (KIT), Ulf (NDGF), Christoph (CMS), Di Qing (TRIUMF), Rob (OSG),
  • apologies: Maarten (ALICE),

Experiments round table:

  • ATLAS
    • Central Services/T0/T1
      • some open tickets with CERN-PROD to be checked by ATLAS
      • FTS3 with Rucio issue with respecting of transfer shares is under investigation

  • CMS reports ( raw view) -
    • Tier-0 processing stuck due to problems with a VM hosting a critical service. Request to move to different Hypervisor, GGUS:111930 (was even put to alarm)
    • Migrated Global Xrootd Redirector to version 4.1
      • Regional redirectors (at FNAL, Nebraska, Pisa, Bari, GRIF) need to restarted
    • Not much work in the distributed system

  • ALICE -
    • T1-T2 workshop ongoing in Torino, ALICE therefore cannot join this meeting
    • NTR at least until Mon

  • LHCb
    • LHCb dist computing activities drained b/c of migration to new file catalog system. We expect to come back from this DT this afternoon.
    • T0:
      • Migration of DBOD DBs together with the above migration (other than FC)
    • T1:

Sites / Services round table:

  • ASGC: NR
  • BNL: NR
  • CNAF: NR
  • FNAL: NTR
  • GridPP: NR
  • IN2P3: NR
  • JINR: NR
  • KISTI: NTR
  • KIT: NTR
  • NDGF: Head node reboot on Wednesday at 7-9 UTC and dCache upgrade, 10-15 minutes downtime of actual outage per node
  • NL-T1: NTR
  • NRC-KI: NR
  • OSG: Fixing emergency contact at FNAL in OIM ticketing system. Close to publish multicore accounting (some last discussions ongoing with APEL).
  • PIC: NR
  • RAL: OUTAGE tomorrow of storage for ATLAS, ALICE & LHCb for 8.30 - 18.00 local time
  • TRIUMF: NTR

  • CERN batch and grid services:
    • VOMRS to be decommissioned and replaced by VOMS-admin on Monday 2nd of March.
      • The intervention will last the whole day. During that period of time, VOMS will be unaffected (e.g. voms-proxy-init) but registrations won't work.
      • More info: OTG0018459
  • CERN storage services: NTR
  • Databases: NR
  • GGUS: NR
  • Grid Monitoring: NR
  • MW Officer: NR

AOB:

Thursday

Attendance:

  • local: Andrea Sciaba', Joel Closier (LHCb), Lorena Lobato (IT-DB), Andrei Dumitru (IT-DB), Herve Rousseau (IT-DSS), Maarten Litmaath (ALICE), Andrea Manzi (IT-SDC), Steve Traylen (IT-PES), Pablo Saiz (IT-SDC)
  • remote: Dennis van Dok (NL-T1), Matteo (CNAF), Sang Un Ahn (KISTI), Lisa Giacchetti (FNAL), Michael Ernst (BNL), Andrej Filipcic (ATLAS), Rob Quick (OSG), Thomas Hartmann (KIT), Ulf Tigerstedt (NDGF), Hung-Te Lee (ASGC), Christoph Wissing (CMS), Rolf Rumler (IN2P3-CC), Di Qing (TRIUMF), Alessandro Di Girolamo (ATLAS)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Had yesterday a meeting with the FTS3 developers about the fairshare bug; the issue should be solved in about two weeks, to be tested in the pilot instance and then moved to production
    • Getting ready for the MC15 production, still many issues to solve, including some Tier-1 disk cleaning

  • CMS reports ( raw view) -
    • Learning about authentication of EOS for our Tier-0 applications, GGUS:111949
    • Dealt with the CMS tickets mentioned in the GGUS section

  • ALICE -
    • NTR

  • LHCb reports ( raw view) -
    • LHCb has migrated to a new catalog architecture on Monday (Dirac File Catalog). The migration went all successful. Many thanks to IT/DB and IT/PES colleagues for their support.
    • Distributed computing dominated by Monte Carlo and user activities.
    • T0:
      • Q: VOMRS move for LHCb date? No reply to some questions yet. Maarten commented that there were comments on the ticket answering some of the points raised. By tomorrow it should be made clear what is the timescale to address the issues seen by LHCb and accordingly decide whether to delay the VOMRS decommissioning.
    • T1:
      • Multiple T1 (and T2D) sites in scheduled DT on Tuesday with their storage. LHCb would like to discuss a possibility to align such DTs better in sequence. Rolf argued that the IN2P3-CC downtime was announced 3-4 weeks ago, again one week ago, each experiment was contacted and LHCb agreed to it. Alessandro's opinion was that it's up to the experiments to verify that the proposed dates for downtimes do not create problems for them, before agreeing to them. Given that everybody agrees that it should be avoided that many Tier-1's have downtimes at the same time, we should 1) have a tool to make easy to see when are the scheduled downtimes and 2) clarify a policy to be followed about who should take care of preventing this kind of problem.

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • GridPP:
  • IN2P3: a downtime is planned for next Tuesday and Wednesday for the MSS and the batch system: dCache disks will be available but writing to/reading from tape will not be possible, running jobs will be killed and queued jobs removed. Alessandro asked to change accordingly the configuration of the garbage collector, given that it will be impossible to migrate to tape for more than 24 hours. Rolf will contact ATLAS offline.
  • JINR:
  • KISTI: ntr
  • KIT: ntr
  • NDGF: dCache was updated. It was noticed that yesterday voms2.cern.ch was flapping, refusing connections for grid-mapfile generation. The problem was not noticed by others and it had a negligible impact, but in case it happens again, a ticket should be opened.
  • NL-T1: a dCache upgrade is planned for March 9-10. The period is long because it is difficult to estimate how much time will be needed by the database upgrade.
  • NRC-KI:
  • OSG: The GGUS alarm testing to FNAL failed and it will be retried today.
  • PIC:
  • RAL: ntr
  • TRIUMF: from next Monday the Tier-1 traffic between TRIUMF and BNL will be routed via the LHCONE.

  • CERN batch and grid services:
    • VOMRS to be decommissioned and replaced by VOMS-admin on Monday 2nd of March.
      • The intervention will last the whole day. During that period of time, VOMS will be unaffected (e.g. voms-proxy-init) but registrations won't work.
      • More info: OTG0018459
  • CERN storage services:
    • CASTORALICE Upgrade to version 2.1.15 on the Monday 9th March 9:00 CET
    • CASTORLHCB Upgrade to version 2.1.15 on Wednesday 4th March 9:00 CET
  • Databases:
    • this week some integration databases have been patched
    • next week there will be another such intervention (OTG0019044) as well as a transparent intervention on back-end storage (OTG0016010)
  • GGUS:
    • New release done on Wed 25th February. Among the changes, GGUS can change FE in SNOW, and help windows with the 'Type of Problem' definitions
      • Issue with test alarms: European and Asian alarm were repeated, BNL and FNAL still have to be repeated
    • Some old tickets require VO attention: CMS: 111376 111367, ATLAS: 108488
  • Grid Monitoring:
  • MW Officer: Reminder on 2 EGI CSIRT broadcasts sent recently :
    • version of FTS < 3.2.31 are vulnerable and expose logging info, so update to 3.2.31 is suggested if not yet done ( version 3.2.31 is also baseline)
    • version of dCache < 2.11.9, 2.10.18, 2.9.21, 2.8.25 and 2.7.29 are vulnerable in case of plain FTP doors are configured (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2015-8183)
    • Joel asked when the next version of gfal2 will be released: now it's in EPEL testing, so it should be out in about two weeks. It is suggested that LHCb can copy it to CVMFS even before the official release; further discussions offline.

AOB: -- AndreaSciaba - 2014-12-16

Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2015-02-26 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback