Week of 140106

WLCG Operations Call details
General Information
Monday
Thursday

WLCG Operations Call details

At CERN the meeting room is 513 R-068.

For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
2. To have the system call you, click here

In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

local: Felix, Jan, Ken, Maarten, Maria D, Stefan
remote: Jeremy, John, Michael, Onno, Rob, Rolf, Sang-Un, Ulf

Experiments round table:

ATLAS reports (raw view) -
- Central services/T0
  - "no FTS server had updated the transfer status" errors still persist (GGUS:99940)
    - No FT transfers during 03:00-09:00, 17:00-19:00, 22:00-10:00 on Jan. 4th
  - Transfer failures: 'Couldn't do ssl handshake OpenSSL Error:... digest too big for rsa key'
    - DE, ES, FR, IT, ND, US clouds. Elog:47458
    - see OpenSSL item in the AOB below
- T1
  - IN2P3-CC: transfer failures. Unable to connect to ccdcatli002.in2p3.fr:2811 (GGUS:100031) fixed.
  - PIC: transfer failures. "Marking Space as Being Used failed => Already have 1 record" (GGUS:100056) fixed.
  - FZK-LCG2: the pool f01-032-101-e_3D_atlas "is not operable" (Elog:47562)
  - INFN-T1: job failures due to "Staging input file failed" happened again on Jan. 6th (GGUS:99982)

CMS reports (raw view) -
- Happy new year! As one might imagine, it's been very quiet.
- Just one open GGUS right now (and not even sure why): GGUS:100014 -- T1_UK_RAL fails Hammercloud tests on 31/12. Statement today that it was due to CASTOR load issues due to 5000 CMS jobs running, much larger than pledge. I guess it is resolved?

ALICE -
- happy new year!
- besides a reprocessing campaign on Dec 27-30 not a lot of activity during the break

LHCb reports (raw view) -
- reprocessing of Proton-Ion collisions almost finished (GRIDKA & CERN)
- At other sites main activities are simulation & user jobs
- T0:
  - SRM to Castor failed twice and restarted quickly after ticket submission (GGUS:99938, GGUS:100030)
- T1:
  - GRIDKA: problems with staging of files, issue resolved after vendor intervention (GGUS:99972)

Sites / Services round table:

ASGC
- there have been transient high temperature alarms during the break, leading to a few service instabilities; all OK now
BNL - ntr
GridPP
- there were a few transient problems during the break, otherwise things were quiet
- the UK CA intends to switch to SHA-2 by mid March
IN2P3
- there have been difficulties with the SRM service since a few weeks, affecting ATLAS and CMS
  - the SRM occasionally stops working, then gets restarted automatically
  - the restart sometimes fails, requiring manual intervention
  - the issue does not look related to the length of the proxies
  - a ticket has been opened for the dCache developers
KISTI
- the WN have been migrated to new HW during the break
- they were first affected by the OpenSSL issue, but have been working OK after downgrading openssl
  - see OpenSSL item in the AOB below
NDGF
- there were small issues affecting some pools during the break, otherwise things were quiet
NLT1
- the DDN equipment trouble reported in Dec has been resolved with the move to new HW
  - a SIR will be provided in the coming days
- SARA will soon connect to LHCONE, more details will be provided
  - an earlier connection plan had in fact been canceled a few months ago
OSG
- on request of CMS the OSG CA has delayed its switch to SHA-2 certificates to Feb 11
RAL
- RAL systems ran well over the break

GGUS
- ticket 100k was reached during the break: congratulations!
- there was 1 alarm and it was handled OK
storage
- srm-lhcb instabilities not understood; workaround available; to be investigated by the developers
- EOS SRM interruptions to be scheduled for upgrade to support SHA-2 and move to new HW
  - this quiet week should be a good time
- EOS-ALICE head node needs to be switched to new HW
  - ditto

AOB:

OpenSSL issue:
- failures for 512-bit key proxies will continue until such proxies are no longer generated by intermediary services (see the ATLAS report)
- so far the common component has been gridsite, for which the latest versions have the default key length increased to 1024 bits:
  - EMI-2 Update 21
  - EMI-3 Update 12
- at least the EMI WMS (SL5 and 6) and CREAM (SL 6) are affected
  - the FTS-3 instances were dealt with ad-hoc; a proper update is expected in a few days
- Simon Fayer of Imperial College concluded:
  - If both endpoints are running the newest openssl, they negotiate TLS1.2 which doesn't support 512-bit keys...
  - This is why only pairs of fully upgraded machines fail (if either end is older, TLS1.1 is used instead).
- we probably will keep seeing fall-out during the next few weeks, as more "broken" service instances get identified and hopefully upgraded ASAP...

Thursday

Attendance:

local: Felix, Jan, Maarten, Manuel, Stefan, Zbyszek
remote: David, Dennis, Gareth, Jeremy, Kyle, Michael, Rolf, Sang-Un, Saverio, Thomas, Xavier

Experiments round table:

ATLAS reports (raw view) -
- Central services/T0
  - OpenSSL issue (elog:47369)
  - ATLAS_CC degraded: voatlas66 suffered from a disk problem (elog:47579)
  - PilotFactories degraded: poor monitoring on the new factory aipanda017. The factory itself is fine and monitoring has been fixed. (elog:47578)
  - ATLAS_DDM_VOBOXES degraded: ss-pic-ndgf(atlas-SS03) slowed down by DaTRI requests (elog:47587)
- T1
  - RAL-LCG2 Transfer failures with "source file doesn't exist" (GGUS:99768), waiting for reply
  - Staging errors at FZK-LCG2 (GGUS:99959)
  - Staging input file failed at INFN-T1 (GGUS:99982)

CMS reports (raw view) -
- Still seemingly in the post holiday calm -- running 2011 legacy MC rereco and 13 TeV MC at T1 sites
- GGUS:100082 Regarding disk/tape separation at IN2P3 -- seemingly bridged from Savannah when it was closed? But nevertheless closed.
- GGUS:100088 SAM CE tests critical during temporary reconfig to DISK PhEDEx endpoint in Trivial File Catalog

ALICE -
- NTR

LHCb reports (raw view) -
- Main activities are Monte Carlo and User jobs
- T0:
  - Move to new SRM (SHA2 enabled) was not successful and switched back to the previous version.
    - Jan: a patch is again needed for the SRM to support the root protocol that LHCb jobs expect to use; this will be followed up
    - Maarten: this may become urgent, depending on the uptake of SHA-2 user certificates within LHCb; if a patch is not available "soon", LHCb may have to look into a change in their code instead
- T1:

Sites / Services round table:

ASGC - ntr
BNL - ntr
CNAF - ntr
GridPP - ntr
IN2P3
- our dCache service is in unscheduled downtime since ~1:30 pm until 4 pm because of SRM instabilities; the developers have been contacted; as the issue looks strongly related to the load, we may need to limit the load until a fix has been applied
- the French CA will switch to SHA-2 as of Jan 14, after which only SHA-2 will be available for new certs
KISTI - ntr
KIT - ntr
NDGF
- minor downtime on Mon to apply SW updates on dCache pool nodes
NLT1 - ntr
OSG
- reminder: the switch to SHA-2 has been delayed from Jan 14 to Feb 11, after which only SHA-2 will be available for new certs
PIC
- a ticket was opened for ATLAS to look into the overload by SRM requests from the CERN FTS-3 pilot used by the functional tests; other sites may have been affected similarly; those tests have been moved back to the FTS-2 for now
  - the matter appears to have been due to the OpenSSL issue: the FTS-3 pilot service is not yet fixed (GGUS:100125)
RAL - ntr

databases
- during the end-of-year break the ADCR DB suffered an Active Data Guard failure due to file system corruption that was due to 3 silent disk failures:
  - bad disks have been removed
  - no new corruptions seen
  - ongoing checks of all data will take until some time next week
  - meanwhile clients are redirected as needed to another standby DB
grid services
- network intervention between CERN and Wigner for ~10 min on Tue may cause job failures during that period (link)
- FTS-3 pilot instance probably will get fixed next week (GGUS:100125)
storage
- SRM upgrade OK for EOS-ATLAS
- SRM upgrade for EOS-CMS required a Java config adjustment to support proxies with 512-bit keys
  - Maarten: that change could eventually be needed also for ATLAS and LHCb
  - see AOB below
- SRM upgrade for EOS-LHCb: see LHCb report
- EOS-ATLAS to be restarted on Mon for a config change to increase the Rucio renaming speed
- network intervention between CERN and Wigner for ~10 min on Tue may cause data access failures during that period (link)

AOB:

OpenSSL issue:
- another broadcast will be sent in the coming days
- also Java-based services may need config adjustments to support proxies with 512-bit keys:
  - jdk.certpath.disabledAlgorithms=MD2, RSA keySize < 512

-- SimoneCampana - 16 Dec 2013

Topic revision: r7 - 2014-01-10 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback