LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes140116 (2014-01-27, JosepFlix)

EditAttachPDF

WLCG Operations Coordination Minutes - VIRTUAL - January 16, 2014

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=280060

Attendance

Local:
Remote:

News

Middleware news and baseline versions

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

No change in baselines since last meeting.
OpenSSL issue update
- see the corresponding presentation in the Jan 15 GDB meeting

Tier-1 Grid services

Storage deployment

Site	Status	Recent changes	Planned changes
CERN	CASTOR: v2.1.14-5 and SRM-2.11-2 on all instances EOS: ALICE (EOS 0.3.4 / xrootd 3.3.4) ATLAS (EOS 0.3.4 / xrootd 3.3.4 / BeStMan2-2.2.2) CMS (EOS 0.3.2 / xrootd 3.3.4 / BeStMan2-2.2.2) LHCb (EOS 0.3.3 / xrootd 3.3.4 / BeStMan2-2.2.2)
ASGC	CASTOR 2.1.13-9 CASTOR SRM 2.11-2 DPM 1.8.7-3 xrootd 3.3.4-1	none	none
BNL	dCache 2.6.18 (Chimera, Postgres 10 w/ hot backup) http (aria2c) and xrootd/Scalla on each pool	None	None
CNAF	StoRM 1.11.2 emi3 (ATLAS, CMS, LHCb)		Upgrade StoRM 1.11.3 (ATLAS) today Jan 16
FNAL	dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3 Scalla xrootd 2.9.7/3.2.7.slc EOS 0.3.2-4/xrootd 3.3.3-1.slc5 with Bestman 2.2.2.0.10
IN2P3	dCache 2.6.18-1 (Chimera) on SL6 core servers and pool nodes Postgres 9.2 xrootd 3.0.4
KISTI	xrootd v3.2.6 on SL5 for disk pools xrootd 20100510-1509_dbg on SL6 for tape pool dpm 1.8.7-3
KIT	dCache atlassrm-fzk.gridka.de: 2.6.17-1 cmssrm-kit.gridka.de: 2.6.17-1 lhcbsrm-kit.gridka.de: 2.6.17-1 xrootd alice-tape-se.gridka.de 20100510-1509_dbg alice-disk-se.gridka.de 3.2.6 ATLAS FAX xrootd proxy 3.3.3-1
NDGF	dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.
NL-T1	dCache 2.2.17 (Chimera) (SURFsara), DPM 1.8.7-3 (NIKHEF)
PIC	dCache head nodes (Chimera) and doors at 2.2.23-1 xrootd door to VO severs (3.3.4)
RAL	CASTOR 2.1.13-9 2.1.13-9 (tape servers) SRM 2.11-1
TRIUMF	dCache 2.2.18
JINR-T1	dCache srm-cms.jinr-t1.ru: 2.6.19 srm-cms-mss.jinr-t1.ru: 2.2.23 with Enstore xrootd federation host for CMS: 3.3.3

FTS deployment

Site	Version	Recent changes	Planned changes
CERN	2.2.8 - transfer-fts-3.7.12-1
ASGC	2.2.8 - transfer-fts-3.7.12-1
BNL	2.2.8 - transfer-fts-3.7.10-1	None	None
CNAF	2.2.8 - transfer-fts-3.7.12-1
FNAL	2.2.8 - transfer-fts-3.7.12-1
IN2P3	2.2.8 - transfer-fts-3.7.12-1
KIT	2.2.8 - transfer-fts-3.7.12-1
NDGF	2.2.8 - transfer-fts-3.7.12-1
NL-T1	2.2.8 - transfer-fts-3.7.12-1
PIC	2.2.8 - transfer-fts-3.7.12-1
RAL	2.2.8 - transfer-fts-3.7.12-1
TRIUMF	2.2.8 - transfer-fts-3.7.12-1
JINR-T1	2.2.8 - transfer-fts-3.7.12-1

LFC deployment

Site	Version	OS, distribution	Backend	WLCG VOs	Upgrade plans
BNL	1.8.3.1-1 for T1 and US T2s	SL6, gLite	ORACLE 11gR2	ATLAS	None
CERN	1.8.7-3	SLC6, EPEL	Oracle 11	ATLAS, LHCb, OPS, ATLAS Xroot federations

Other site news

Experiments operations review and Plans

ALICE

besides a reprocessing campaign on Dec 27-30 not a lot of activity during the break
central services moved to new HW on Jan 14-15
- also more efficient power/UPS distribution and cooling
- Torrent no longer available
CERN
- high error rates after central services restart, to be debugged
- SLC6 vs. SLC5 job failure rates and CPU/wall-time efficiencies
  - Wigner jobs were running with Torrent since Dec 20 to test if SL6 builds behave better than the SL5 builds used via CVMFS: no improvement
  - to be continued...
CVMFS
- almost all sites using it in production
- a few in various stages of preparation
- sites please ensure the WN have version 2.1.15 (or higher)
SAM (reminder)
- ALICE sites that care about the SAM Availability and Reliability reports should ensure they look OK in the ALICE SAM tests, which will be used instead of the Ops tests as of Jan 2014:
  - ALICE SUM Dashboard
  - ALICE Nagios page

ATLAS

Re-starting pushing FTS3: adding more sites into RAL FTS3 instance
SAM CVMFS probe is being integrated: now sorting out issues in pre-production
MultiCore production started on limited number of sites, which are now working on their internal scheduling configuration (part of the MultiCore task force)

CMS

Processing / Production over Xmas break
- Went reasonably fine
- Best effort from sites and experiment teams

Filling Tier-1s with real and backfill work
- Try to run processing machinery under load
- Identify bottlenecks
- Want to extend backfill at Tier-2 level to generate full load
- Needed to start exercising for next year

CMS file catalog DBS2 to migrated to DBS3
- Preliminary Scheduled for week of Feb 10th
- Will take several days
- Need to drain production during the week before

Remainder: CMS intends to make the GLEXEC SAM test critical ~ end of January 2014

LHCb

Low production activity at the moment with simulation productions only
pA/Ap reprocessing campaign started before Xmas and finished in the last days of the Xmas break
Working on WMS decommissioning of the last sites (only small sites)

Ongoing Task Forces and Working Groups

FTS3 Deployment TF

Resumed performance tests after holiday break.

gLExec deployment TF

69 tickets closed and verified, 26 still open
EMI gLExec probe (in use since SAM Update 22) crashes on sites that use the tar ball WN and do not have the Perl module Time/HiRes.pm installed (GGUS:98767)
- installation of that dependency now looks sufficient to cure the problem
- a proper fix is still to be decided
Deployment tracking page

IPv6 validation and deployment TF

Job submission to a CE via gLite WMS both using IPv6 to communicate was successfully tested to work
A similar test via Condor-G failed, due to a bug in HTCondor, which uses a rather old version of CREAM (condor-8.1.3/externals/bundles/cream/1.12.1_14). Waiting for feedback from the developers
CMS tests
- The next steps are to get a BeStMan SE and an OSG CE to run data transfer and job submission tests respectively. If successful, will proceed to test glideinWMS in dual stack to submit jobs to IPv6-only CEs and WNs. Most central services do not need to be tested on IPv6 for the time being (typically those services that do not directly interact with site resources).

Machine/Job features TF

the mjf.py client has been installed on slc5 (bare metal) and slc6 (virtualised) CERN workernodes, the machine/job features are being fed to those nodes
a rpm for the client is available at the wlcg repository
Asking VOs to try these out and report any problems/issues discovered within a month if possible
- In a later stage rollout the service and client to more sites

Middleware readiness WG

the next meeting is planned for Feb 6 at 16:00 CET
- in particular about how to involve experiments and sites

Multicore deployment

Call for the first meeting already sent to TF participants. We will start next week, date yet to be confirmed, see doodle
- we would like to dedicate the first meeting to a summary of the current situation, which is dominated so far by ATLAS and CMS experiences.
- we will also organize the following meetings, in particular the frequency of the meetings, which may initially be of one meeting every week, given the fast pace of the ongoing activities, then go for a meeting every two weeks.

perfSONAR deployment TF

all remaining sites (indicated by experiments) not deploying perfSONAR have been ticketed. Slow progress and many negative answers (unavailability of hardware)
a perfSONAR-lite installation is being discussed with Internet2/ESnet
work ongoing to decide how to replace the perfSONAR modular dashboard
almost all sites opened access to the perfSONAR main page

SHA-2 Migration TF

very few services monitored by EGI not ready for SHA-2
- see EGI News at Jan 15 GDB meeting
FNAL intend to be ready by mid or late Feb
- OSG CA switch to SHA-2 postponed to Feb 11 on request of CMS
EOS SRM instances
- OK for ATLAS and CMS
- not yet OK for LHCb: a patch will need to be ported to let the new SRM support the root protocol expected by LHCb jobs
  - OSG BeStMan support ticket opened
the CERN CA has delayed making SHA-2 the default to Feb 11 as well
- SHA-1 remains available until June
the UK CA intends to switch mid March
the French CA has switched on Jan 14
VOMRS
- the VOMS-Admin instability has been fixed (Java 7 --> 6)
  - many thanks to Valerio Venturi (RIP) for debugging this!
- the VOMS-Admin test cluster is being reinstalled and should become available in the coming days
timelines
- by mid Feb the WLCG infrastructure is expected to be essentially ready
- the migration from VOMRS to VOMS-Admin may take longer

Tracking tools evolution TF

WMS decommissioning TF

usage of the CMS WMS at CERN has remained low in recent weeks
VOs using the "shared" WMS instances will be contacted for migration plans
- in particular LHCb and ILC
- ops usage from other regions discussed with EGI operations, ticket open to a French site (https://ggus.eu/ws/ticket_info.php?ticket=100330)

xrootd deployment TF

Action list

Investigate how to separate Disk and Tape services in GOCDB
- proposal submitted via GGUS:93966
- in progress - ticket updated
Agree with IT-PES on a clear timeline to migrate OPS and the LHC VOs from VOMRS to VOMS-Admin
- in progress
Collect feedback from VOs about need for grid-cert-info and setting EMI-UI 2.0.3 as baseline.
- closed

It was agreed that the last action item can just be closed.

-- AndreaSciaba - 13 Jan 2014

Topic revision: r16 - 2014-01-27 - JosepFlix

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback