Week of 150420

WLCG Operations Call details
General Information
Monday
Thursday

WLCG Operations Call details

At CERN the meeting room is 513 R-068.

For remote participation we use the Vidyo system. Instructions can be found here.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Web
Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

local: Maria Dimou (SCOD), Alberto Peon (CERN Grid services), Hervé Rousseau (CERN Storage), Stefan Roiser (LHCb), Maarten Litmaath (ALICE), Alessandro di Girolamo (ATLAS).
remote: Sang Un Ahn (KISTI), Dmytro Karpenko (NDGF), Hung-Te Lee (ASGC), Rolf Rumler (IN2P3), Matteo (CNAF), Michael Ernst (BNL), Onno Zweers (NL_T1), Pavel (KIT), Tiju Idiculla (RAL), Tomasso boccali (CMS), Rob Quick (OSG), Di Qing (Triumf).

Experiments round table:

ATLAS reports ( raw view) -
- Jobs at BNL_PROD are failing with "Job killed by signal 15" (GGUS:113114) -- already followed up by BNL experts
- Transfers from SARA-MATRIX are failing with "The available CRL has expired" (GGUS:113116) -- solved
- GGUS:113085 CERN slow prestage -- already followed by CERN experts to increase the time-out from 6 hours to 2 days
- GGUS:112836 Transfer failures from Taiwan-LCG2_MCTAPE - just too many transfers from Taiwan tape. Tw experts and ADC experts are in contact

CMS reports ( raw view) -
- all quiet, not a huge activity at sites (waiting to launch RECO for 2015 MC samples), mostly analysis activity
- reached the quota limit on EOS unmerged area, since now also the T0 writes there. Understanding and setting new quotas.
- some sites are complaining about "hot files" being requested much more than the average. This is understood at our SW level: the first pile-up file is always opened, in order to read metadata. We are trying to find solutions.
- some small issues reported (not much activity on them - CHEP bias?)
  - GGUS:113049 (CERN) Castor-T0CMS instance in bad shape for ~24 hours, then recovered. It seems we have saturated the number of possible concurrent transfers (would be 700, we were using 300 but we were using a fraction of the servers).
  - GGUS:113036 T0->T1 tests were failing for a few days, which turned red all the T1s for readiness. Mostly corrected, apart from 2 days still red which will need another correction.
  - GGUS:112942 (CERN) ARGUS not answering correctly for some DNs. Being followed, and the moment cured by adding servers but real issue non understood (as far as I understand).

On the last point Maria asked Alberto (T0) to convey a reminder of the relevant action 20150318-02 from the last MW Readiness WG, specifically for testing ARGUS under load.

ALICE -
- new job records on Sunday: 82k maximum, 80k+ for a few hours!
- CERN: on Fri some CASTOR files could not be accessed from some of the WN
  - GGUS:113106, not yet resolved
- GRIF-IRFU: ALICE jobs filled /tmp with ROOT merge files on a WN
  - GGUS:113127, to be investigated

LHCb reports ( raw view) -
- Operations dominated by MC jobs and user analysis
- VOMS-Admin ticket GGUS:112282 currently on hold

Sites / Services round table:

ASGC: ntr
BNL: The new HTCondor deployment last week caused a problem during the weekend. After close follow-up and fixing the job error rates are now back to acceptable numbers.
CNAF: ntr
FNAL: not connected
GridPP: not connected
IN2P3: ntr
JINR: not connected
KISTI: ntr
KIT: ntr
NDGF: ntr
NL-T1: Three nodes running the new dCache suffered a broken cron job doing the fetchcrl.This is fixed this morning. 5% of the transfers were harmed because of this issue.
NRC-KI: not connected
OSG: Rob asked Michael to update the BNL ticket GGUS:112737 on checksum errors. Alessandro asked what technology is used that Michael explained this is a Classic SE on Amazon storage S3 with SRM or gridftp.
PIC: not connected
RAL: An event at the site will cause some slow response during Wednesday and Thursday this week. There will be answers in GGUS tickets but no participation on this Thursday's 3pm call.
RRC-KI: not connected
TRIUMF: The new Triumf-CERN link is put in place, most probably today, and the old link is being decommissioned.

CERN batch and grid services:
- NTR
CERN storage services:
- EOS-to-CASTOR intervention for ATLAS will take place tomorrow, as agreed.
- New behaviour in FTS (request MD5 checksum) triggers a corner-case bug in Castor that prevents further transfers from this specific diskserver to occur => A workaround is being implemented, but why does FTS requests MD5 checksum as we agreed for ADLER32 checksums ?
Databases: not present
GGUS: not present
Grid Monitoring: not present

AOB:

Thursday

Attendance:

local: Maria Dimou (SCOD), Alberto Peon (CERN Grid services), Hervé Rousseau (CERN Storage), Stefan Roiser (LHCb), Maarten Litmaath (ALICE), Kuo-Hao Ho (ASGC), Andrea Manzi (MW Officer), Pablo Saiz (GGUS & Grid Monitoring).
remote: Sang Un Ahn (KISTI), Dmytro Karpenko (NDGF), Rolf Rumler (IN2P3), Matteo (CNAF), Michael Ernst (BNL), Dennis van Dok (NL_T1), Thomas Hartmann (KIT), Christoph Wissing (CMS), Rob Quick (OSG), Di Qing (Triumf).

Experiments round table:

ATLAS reports ( raw view) -
- Nothing to report.

CMS -
- Nothing to report

ALICE -
- continued high activity
- CERN:
  - CASTOR: file access instabilities for re-reco jobs (GGUS:113106)
    - ad-hoc cures applied by CASTOR operations team
    - code fix being implemented
  - px402.cern.ch is failing proxy renewals for NIKHEF (GGUS:113240)

Dennis commented that they observe (not VO-specific) job failing at NIKHEF which is not yet understood. Maarten's debugging for ALICE (specifically) showed a VOBox-CERN communication issue related to a given myproxy server. The other two servers were working fine. The third one also worked fine in a test just before the meeting, while nothing was changed on that host. The matter is not yet understood.

LHCb reports ( raw view) -
- Operations dominated by MC jobs and user analysis
- VOMS-Admin ticket GGUS:112282 currently on hold
- few issues with T2 sites, NIPNE FTS transfer failures being looked into (GGUS:112044), pilots to DESY-ZN new CEs at first not being served, fixed now (GGUS:112700)

Stefan added at the meeting that instabilities were observed with Storage at SARA in the night of Tuesday to Wednesday. They are gone now. Dennis didn't know the cause but he would check.

Sites / Services round table:

ASGC: ntr
BNL: ntr
CNAF: ntr
FNAL: not connected
GridPP: not connected
IN2P3: The batch systems' downtime announced for Sun-to-Mon is now moved to next Tuesday evening to Wednesday noon. All jobs found running in that period will be lost.
JINR: not connected
KISTI: From Sang-Un by email: KISTI not be accessible for 1-3 hours between 02:00 and 05:00 (CET) on 24 April 2015 to increase MTU of firewall for large bandwidth network. At the meeting he added that their MTU value is now changed and tests on the new link will start tomorrow and go on during the weekend. Maria asked if it is fine to migrate the link on a Friday. Sang-Un confirmed that ALICE VO manager Lachezar Betev confirmed availability for mutual testing of the new KISTI-Chicago-Amsterdam-CERN link. The current line will be kept as a backup connection.
KIT: ntr
NDGF: Tomorrow a java and dCache pools' update will take place, access to data will be unstable.
NL-T1: Help is needed by BNLS to understand the reason for a BNL-SARA transfer problem as per GGUS:113207. There might be a CRL issue. Michael promised to ask specialists to check.
OSG: ntr
PIC: (added after the meeting)
- Tuesday 28th of April, PIC will be in complete scheduled downtime due to the following reasons:
  - New firewall installation
  - dCache 2.10 minor upgrade and pool + xrootd door configuration changes
  - Upgrade of the RedHat Virtualization platform
  - Computing system upgrade to Scientific Linux 6.6
  - PIC will be in OUTAGE from 28-04-2015 07:00 [UTC] (28-04-2015 09:00 PIC local time) until 28-04-2015 14:00 [UTC] (28-04-2015 16:00 PIC local time).
RAL: not connected as announced last Monday.
TRIUMF: the dCache update to v.2.10.24 took too long but it is now done successfully.

CERN batch and grid services: ntr
CERN storage services: ntr
Databases: not present
GGUS:
- Release scheduled for Wednesday 29th April from 6:00 to 8:00 UTC ( announced in GOCDB). The new release includes bug fixes in the GGUS interface, and linkable urls for the master-slave relations and crossreferences. Also security patches' installation as per JIRA:1420. There will be ALARM tests as usual on release date at agreed times.
Grid Monitoring: ntr
MW Officer: ntr

AOB:

Today's WLCG Ops Coordination meeting is virtual. Please enter content in the minutes, indispensable material for the WLCG MB and GDB Service reports.
Participation by all Tier1s is valuable and indispensable for operations. We look forward to hear more of our sites connected.

-- MariaDimou - 2015-04-23

Topic revision: r17 - 2015-04-24 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback