Week of 140505

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Maria D (SCOD), Tomas (ATLAS), Maarten (ALICE), Felix (ASGC), Kate (CERN Databases), Daniel (CERN Grid Services), Luca (Storage), Ken (CMS), Eddie (Grid Monitoring).
  • remote: Rolf (IN2P3), Dimitri (KIT), Vladimir (LHCb), Salvatore (CNAF), Kyle (OSG), Michael (BNL), Pepe (PIC), Christian (NDGF).
Experiments round table:

  • ATLAS reports (raw view) -
    • Central services/T0
      • deletion errors at CERN-PROD_DATATAPE (GGUS:103691), SURLs turned into question marks. Luca was asked to re-assign this ticket in GGUS to the Support Unit VOSupport. ATLAS will follow-up on this. The RAL problem reported below is unrelated.
      • Jobs at CERN-P1 are failing with "cannot obtain credentials for protocol: Seckrb5" (ELOG:49147)
      • ss-ral degraded (ELOG:49217)
      • SAM-based automatic blacklisting affected by expired CRLs on SAM nodes (already fixed)
    • T1
      • staging errors at FZK-LCG2_MCTAPE (GGUS:104987)
      • deletion errors at RAL-LCG2 (ELOG:49039)
      • staging errors at SARA-MATRIX tapes (GGUS:105102)
      • CVMFS stratum 1 problems at TAIWAN-LCG2 (ELOG:49215)

  • CMS reports (raw view) -
    • The only issue I have been following is the fallout from the DigiCert CRL expiration date issue that was mentioned by OSG at the Friday meeting. While very few users are using DigiCert certificates (maybe we have found one affected user), US sites do use these certs on their servers. This led to a failure of SAM tests at pretty much all of the US sites over the weekend. However, things are fine now, as Nagios has now picked up the latest CRL with the correct expiration date. What is not understood is why this didn't happen until Monday morning, when the CRL's supposedly get updated every six hours and DigiCert supposedly had a fix on Friday. Andrea Sciaba is looking into it. Maarten reported that the duration of the problem was due to a reason not related to the OSG DigiCert. A new version of the fetch-crl rpm was given a wrong name, the cron job didn't run successfully when the OSG fixed their crl problem, hence the propagation took till this morning. Eddie will open a GGUS ticket to make sure the Grid Monitoring scripts will be checked to avoid such issue coming-up again when another CA changes its crl.

  • ALICE - ntr

Discussion during the meeting included a dCache problem at GridKa and PIC for brazilian certificates. Vladimir will open a GGUS ticket for each of these sites, although this might be a dCache bug and not a site problem. GridKa already confirmed it is re-produceable.

Sites / Services round table:

  • ASGC: ntr
  • BNL: The crl problem was still affecting the site on Saturday for transfers from the CERN FTS servers. Many thanks to Michail Salichos for the investigation and quick repair of the problem during the week-end.
  • FNAL: no report
  • OSG: Checking the DigiCert policy to avoid a re-occurrence of last week's issue.
  • NL_T1: no report
  • IN2P3: ntr
  • KISTI: no report
  • NDGF: A cooling system failure at one of their sites obliged them to remove the computing nodes to rescue the storage. They now support rucio. Last site will be configured on Wednesday.
  • RAL: no report
  • KIT: An update on the SRM problem reported by LHCb is in the GGUS ticket. The issue will be over soon. There was a tape library problem last Friday, the reason is not understood, the symptom is that tapes remain mounted after use.
  • CNAF: ntr
  • PIC: The LHCONE link connecting to their T2s was activated on 23 April. All works well since, therefore it can now be announced as 'in production'.

  • GGUS: no report

  • CERN:
    • Grid monitoring: ntr
    • Storage: ntr
    • Grid services:
    • Databases: The CMS Online db will be moved to Oracle v.11.2.04 this Wednesday, starting 8am CEST and, hopefully, ending at noon. The service will be at risk.
AOB: CERN Power test tomorrow at 6:30am CEST for 10 mins. Details HERE.

Thursday

Attendance:

  • local: Maria D (SCOD), Maarten (ALICE), Felix (ASGC), Daniel (CERN Grid Services), Luca (Storage), Pablo (Grid Monitoring & GGUS).
  • remote: Sang-Un (KISTI), Alexander (NL_T1), Ian (CMS), Vladimir (LHCb), Rob (OSG), Christian (NDGF), Pepe (PIC), Gareth (RAL), Antonio (CNAF).
Experiments round table:

  • ATLAS reports (raw view) -
    • Central services/T0
      • site services still unstable, waiting for the DQ2 fix ( voboxes degraded (ELOG:49262))
      • Jobs at CERN-P1 are failing with "cannot obtain credentials for protocol: Seckrb5" (GGUS:105227)
    • T1
      • NTR
    • Nobody present or connected.

  • CMS reports (raw view) -
    • CMS continues with simulation and analysis tool testing. Several small but time critical samples are in production and Tier-1s and Tier-2s. So far running smoothly. Only reported issue was a misunderstanding about whether the SHA-2 versions of the voms server should be used for anything.

  • ALICE - ntr

Sites / Services round table:

  • ASGC: ntr
  • BNL: Last Tuesday's intervention finished successfully and ahead of time! Now enjoying 400Gbit links between switches.
  • FNAL: no report
  • OSG: The analysis of last week'd DigiCert problem is on-going. Luckily, no operation disturbance appeared since.
  • NL_T1: ntr
  • IN2P3: no report
  • KISTI: During a firewall maintenance intervention yesterday, a problem occurred with the update of the configuration rules which is now being investigated.
  • NDGF: ntr
  • RAL: ntr
  • KIT: no report
  • CNAF: ntr
  • PIC: ntr

  • CERN:
    • Grid Services: SHA2 certificates automatically added to the ALICE VOMS users this morning. (http://cern.ch/go/b8k9) For LHCb and ATLAS this will happen next week (12th and 14th of May respectively)
    • Grid monitoring: ntr
    • Storage: busy investigating the ATLAS ticket mentioned above.

  • GGUS: Proposal to stop ticket creation through email. More info. At the moment, this feature can cause the creation of a lot of fake tickets. There are not that many real tickets opened through email. Discussion showed that stopping the GGUS ticket submission by email will be welcome by WLCG where most tickets are TEAM ones, where the only submission possibility is via the web already.

AOB:

-- SimoneCampana - 20 Feb 2014

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2014-05-08 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback