LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes121004 (2012-10-09, MaartenLitmaath)

EditAttachPDF

WLCG Operations Coordination Minutes - 4th October 2012

Agenda

http://indico.cern.ch/conferenceDisplay.py?confId=211021

Attendance

Local: Maria Girone (chair), Andrea Sciabà (secretary), Maria Dimou, Eva Dafonte, Alexandre Beche, Maarten Litmaath, Ian Fisk, Simone Campana, Oliver Keeble, Ikuo Ueda, Stephen Gowdy, Maite Barroso, Andrea Valassi, Nicolò Magini, Stefan Roiser, Alexei Klimentov, Joel Closier
Remote: Massimo Sgaravatto, Gonzalo Merino, Daniele Bonacorsi, Ron Trompert, Jhen-Wei Huang, Rob Quick, Mine Altunay, Stephen Burke, Jeremy Coles, Gareth Smith, Anthony Tiradani, Christoph Wissing

Task Force reports

CVMFS (Stefan)

The TF is looking for sites to help with testing new CVMFS versions, in particular two new features: NFS export and shared cache. A mailing list was created for the TF and one for the sites still having to deploy CVMFS, while the Twiki is in preparation. After September 5 several sites installed CVMFS. There is a lot of interest for installing the EMI WN tarball distribution under CVMFS. The next step is to contact sites to know about their CVMFS installations (or plans to install it) and collect the information in the Twiki.

gLExec

PerfSonar (Simone)

Yesterday we had the first TF meeting. The expertise coverage is good (we just miss ALICE) and we do not plan to use a mailing list. Now PS is widely deployed in USATLAS and USCMS in the LHCOPN and on several European ATLAS sites. However the current methodology and configuration will not scale to more sites, as the same links are tested multiple times: a new approach will be needed. The next steps are:

the experiments need to define their priorities for sites and links; ATLAS already did it
push the deployment of the PS service on the priority sites, initially using a basic configuration (documentation is already available from ATLAS and CMS and will be unified)
validate the mesh configuration capability of PS, to simplify configuration, and define the testing patterns
verify the compatibility between the two PS flavours (Mario Reale offered to do it)
officially add PS as a WLCG site service in OIM and GOCDB (as it has to be done for xrootd, squid, etc.)
ensure that monitoring scales as well (ongoing development on the BNL PS dashboard)

Tracking tools (Maria D.)

E-group wlcg-ops-coord-tf-tracktools@cernNOSPAMPLEASE.ch was created to contain the members of this TF. WLCG-Ops members with expertise in tracking tools (GGUS, SNOW, Savannah, Trac, Jira) are encouraged to self-subscribe for owner's approval. In this meeting

we shall only discuss the future adoption of SHA-2 for the GGUS service certificate, so, notes on this decision will be recorded in the dedicated section below and in Savannah:132379

, where this issue is being tracked.

Mine says that work is still ongoing to understand the impact of the SHA-2 migration. It was not clear if the change in GGUS was to happen in the test environment or in production; anyway the testing is not finished.

SHA-2 migration (Maarten, Mine)

We are in the process of building the TF and figure out who should be a member. There is already quite some middleware testing ongoing in OSG, EMI and EGI. So far, problems in dCache and BestMAN were found and fixes are being worked on. On EMI, even the latest versions of some components are not compliant and providing fixes is a high priority activity. Also the experiment software needs to be validated.

On the EGI side a validation infrastructure is being set up and will use the ops VO to run tests; the kind of tests to run is being discussed (e.g. SAM tests). This is important because the validation done by the developers themselves might not cover all relevant use cases. We should aim at having a single validation infrastructure that may be used also by experiments. The picture will be much clearer in the coming weeks.

The DOE Grid CA has now a full, production-level CA infrastructure that can be used for testing. It might be advantageous to use it in general.

FTS 3 integration and deployment

co-locate FTS3 task force meeting and FTS3 demos
share the fts3-steering@cernNOSPAMPLEASE.ch mailing list. People interested please sign in.
- This will be the mailing list where all the technical and organizational discussions will take place.
- Bugs and feature requests will be followed up as usual in trac (and not in the mailing list)

More details about meetings:
- We propose to fix Wednesday at 17:00 CET as timeslot. We propose to use the same slot for both the activities, giving priority to the activity that needs more time.
- Frequency of the meetings: each 3 weeks (as FTS demo right now), but keeping in mind that in particular moments we could need to have them more often.
Next meeting of the task force will be focused on the FTS demo: Wednesday, October 17 from 17:00 https://indico.cern.ch/conferenceDisplay.py?confId=211609

Middleware deployment

The EGI security dashboard will flag hosts found running services that are unsupported (gLite 3.1 and the majority of gLite 3.2 services). The COD team will open tickets for the corresponding sites to upgrade or decommission unsupported services by the end of this month or risk suspension of the entire site:
- EGI broadcast

Worker Node testing for WLCG

A discussion follows about a seeming contradiction with the WLCG management, according to which sites may run unsupported services at their own risk but without the penalty of suspension. As Maarten explains, EGI sites have an MoU also with EGI, with which they must comply. Alexei warns against upgrading to not yet validated versions and before the end of data taking, and proposes to discuss this again at next week's GDB. On the other hand, Ian points out, only T0 and T1 sites are critical for data taking and they ought not run obsolete software. Maarten reminds the audience that the end of data taking is far from being the end of computing: there will be the winter conferences, the parked data processing, the summer conferences, the summer vacation, and in the autumn there may be a big reprocessing campaign etc. It is never a good time to upgrade and we would thus end up running obsolete SW until some time in 2014. Maarten said that sites need to start doing what they have been neglecting.

Concerning the TF, is to be decided if to have regular meetings; suggestions are welcome.

XrootD

No report.

Squid monitoring

No report.

WMS future

No report.

News from other WLCG working gropus

No report.

Experiment operations review and plans

ALICE (Maarten)

central AliEn services partial upgrade to v2-20 done on Mon Oct 1
- DB performance improvements by up to a factor 40
- job resubmission faster by 2 orders of magnitude
conditions data migration to OCDB name space in EOS done
- thanks to IT-DSS!
ALICE_DISK CASTOR to EOS migration ongoing
- data being copied by IT-DSS since a while, thanks!
- actual deletion of CASTOR replicas requires careful coordination with ALICE!
ongoing site migration to Torrent for SW provisioning
- 40 out of 72 done

ATLAS (Simone)

ATLAS is validating EMI-1/2 with respect to production and analysis and also using direct I/O (for which the SE flavour is relevant), with the help of a few sites. Some combinations still need testing (dCache and EOS), but the validation against DPM, StoRM and CASTOR is positive. There is a well-known bug in lcg_util, but a fix is already available (but not in an official release until next week). The next steps also include installation under CVMFS and SL6+EMI-2 validation; in particular the physics validation of Athena is to be done and will require some weeks due to the need of producing several millions of events. If successful, SL6 will be approved by ATLAS.

Joel adds that also LHCb is validating SL6.

CMS (Ian)

CMS Computing Ops: overview (last 2 weeks)

CERN/T0:
- TS until Sep-24, normal Ops after that until today
- 2 inaccessible files holding up the T0 Prompt Calib loop
- scheduled intervention (2 hrs) in CMS T0 for the usual DB/ProdAgent cleanups during TS
- some load put on Castor by standard ops and T0 tests
T1
- in general smooth. Only minor (and quickly solved) issues at ASGC and CNAF
T2
- A number of glide-in issues reported, i.e. lack of jobs being scheduled. Not site issues though
- MC production is proceeding smooth in all campaigns

CMS Computing Ops: plans (next 1-2 weeks)

continue processing, e.g. all MC campaigns, mostly Summer12 GEN-SIM and DIGI-RECO (plus FASTSIM)
We could observe an analysis burst due to HCP’12?

LHCb (Stefan)

T1 sites can now decommission the read-only LFC, the Oracle conditions database and Oracle streaming. The read-only and read-write LFC's at CERN will still be used, and conditions data will be distributed via CVMFS.
A full reprocessing was started on September 17 using T1 sites for full reco workflows and some T2's for data reco; it will probably last until the end of the year.
LHCb started using EOS, not only as buffer but also for physics data; thanks to the EOS team for their responsiveness to the issues we had.
A major incident at CNAF, caused by a hardware intervention, made the LHCb storage unavailable for 10 days. In order to recover, CNAF granted us a temporary increase in the number of job slots (for which we thank them).
from this year we started attaching T2 sites to T1 storage to help with data processing; when this happens, the number of running jobs increases. Only sites with CVMFS are selected.

A discussion follows about the need to have Oracle at T1 sites. The only remaining services needing Oracle are FTS and Frontier, but FTS 3 can also use MySQL (and it is entirely possible that it will be deployed only at CERN). Frontier needs more discussion.

GGUS tickets

No issues were submitted by sites or experiments.

Tier-1 Grid services

Storage deployment

Site	Status	Recent changes	Planned changes
CERN	CASTOR 2.1.12-10 and 2.1.13-3; SRM-2.11 for all instances. EOS 0.2 for all instances	EOS upgraded	going to CASTOR-2.1.13-5
ASGC	CASTOR 2.1.11-9 SRM 2.11-0 DPM 1.8.2-5	Sep 18 unscheduled downtime for power cut	Testing CASTOR 2.1.12 on testbed
BNL	dCache 1.9.12.10 (Chimera, Postgres 9 w/ hot backup) http (aria2c) and xrootd/Scalla on each pool	None	None
CNAF	StoRM 1.8.1 (Atlas, CMS, LHCb)
FNAL	dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3 Scalla xrootd 2.9.7/3.2.2-1.osg Oracle Lustre 1.8.6 EOS 0.2.17-1/xrootd 3.2.2-1.osg with Bestman 2.2.2.0.10	EOS and xrootd upgraded
IN2P3	dCache 1.9.12-16 (Chimera) on core servers and pool nodes. New hardware (more RAM, SSD disks) for Chimera and SRM servers (with SL6). Postgres 9.1
KIT	dCache atlassrm-fzk.gridka.de: 1.9.12-11 (Chimera) cmssrm-fzk.gridka.de: 1.9.12-17 (Chimera) gridka-dcache.fzk.de: 1.9.12-17 (PNFS) xrootd (version 20100510-1509_dbg)		upgrade to xrootd 3.1 underway
NDGF	dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.	dCache upgraded	Plan to upgrade to 2.4 when it is released
NL-T1	dCache 2.2.4 (Chimera) (SARA), DPM 1.8.2 (NIKHEF)	dCache upgraded
PIC	dCache 1.9.12-20 (Chimera)
RAL	CASTOR 2.1.11-8/2.1.12-10 2.1.11-8/2.1.12-10 (tape servers) SRM 2.11-1	updated ATLAS to 2.1.12	update remaining instances to 2.1.12 during Oct
TRIUMF	dCache 1.9.12-19 with Chimera namespace	dCache upgraded

FTS deployment

Site	Version	Recent changes	Planned changes
CERN	2.2.8 - transfer-fts-3.7.7-2
ASGC	2.2.8 - glite-data-fts-service-3.7.7-2
BNL	2.2.8	None	None
CNAF	2.2.8 - glite-data-fts-service-3.7.7-2
FNAL	2.2.8 - glite-data-fts-service-3.7.7-2
IN2P3	2.2.8 - glite-data-fts-service-3.7.7-2
KIT	2.2.8 - glite-data-fts-service-3.7.7-2
NDGF	2.2.8 - glite-data-fts-service-3.7.7-2
NL-T1	2.2.8 - glite-data-fts-service-3.7.7-2
PIC	2.2.8 - glite-data-fts-service-3.7.10-1
RAL	2.2.8 - glite-data-fts-service-3.7.10-1
TRIUMF	2.2.8 - glite-data-fts-service-3.7.7-2

LFC deployment

Site	Version	OS, distribution	Backend	WLCG VOs	Upgrade plans
BNL	1.8.0-1 for T1 and 1.8.3.1-1 for US T2s	SL5, gLite?	Oracle	ATLAS	None
CERN	1.8.2-0	SLC5, gLite?	Oracle	ATLAS, LHCb
CNAF	1.8.2-2	SL5, gLite?	Oracle	LHCb
IN2P3	1.8.2-2	SL5, gLite?	Oracle 11g	LHCb
KIT	1.8.2-2	SL5, gLite?	Oracle	LHCb
NDGF	n/a	n/a	n/a	none	decommissioned
NL-T1	1.8.2-2	CentOS5, gLite?	Oracle	None
PIC	1.8.2-2	SL5, gLite?	Oracle	LHCb
RAL	1.8.3.1-1	SL5, EMI	Oracle	LHCb
TRIUMF	1.7.3-1	SL5, gLite?	MySQL	None	will be decommissioned soon

Other site news

Data management provider news

CASTOR news

CERN operations and development

CASTOR 2.1.13-5 released; tentative deployment at CERN on all instances in week of October 8th.

EOS news

xrootd news

dCache news

StoRM news

DPM news

Current release is 1.8.3

EPEL packaging, HTTP interface, synchronous GET, bugfixes
32bit libraries are available via the EMI-2 testing repo (and EPEL).

Finalising 1.8.4

First "dmlite" release bringing performance improvements to HTTP interface
rewritten dpm-xrootd plugin (already production tested)
32bit library support, improved draining, bugfixes.

FTS news

Patch for proxy expiration issue available and deployed at RAL and PIC, details in GGUS:81844

Only two sites run the very latest version of FTS which fixes a sporadic problem with delegation. Oliver recommends all sites to upgrade to the latest version.

LFC news

gfal/lcg_util news

gfal/lcg_util was released 1.13.0 (to EMI and EPEL), including first production release of gfal2 and 32bit support for all libraries.

As part of the validation efforts of the EMI2 WN, packages fixing the following problems (seen in 1.13.0) are available in the EMI testing repo:

lcg_util - lcg-cp hangs if LFC_HOST is not set (https://svnweb.cern.ch/trac/lcgutil/ticket/308)
lcg_util - lcg-cp times out on long/slow transfers
gfal - installation conflict 32/64 (SL6 only)

Middleware news and baseline versions (Nicolò)

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

The baseline versions page was brought up to date, including the recommendations to sites for upgrading to EMI (using the UMD-2 repository, or EMI-2 for products not yet in staged rollout). In particular it is strongly recommended to upgrade LFC and DPM to their EMI versions (direct upgrade from gLite is supported). Clients (UI, WN) should not be upgraded until all the VOs supported at a site have given their approval, which in this discussion means ATLAS and/or CMS, assuming the WN remains on SL5. For SL6 also LHCb and ALICE should be consulted.

According to Maarten, by the end of next week there should a matrix of supported experiments and SE types, that sites can use to decide when they can upgrade. The latest EMI WN release has issues for ATLAS and CMS and the next official release is due in two weeks, but fixes could be downloaded already now from the EMI-2 testing repository.

Finally, it is clarified that when a site upgrades it should choose EMI-2 over EMI-1 (apart from the case of CREAM), also because EMI-1 will be supported only until the end of April and is not compliant with SHA-2 and/or RFC proxies.

Action list

-- AndreaSciaba - 01-Oct-2012

Topic revision: r33 - 2012-10-09 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback