LCG Web>WLCGGDBDocs>GDBMeetingNotes20130713 (2013-07-12, MichelJouvin)

EditAttachPDF

Summary of GDB meeting, July 10, 2013 (CERN)

Agenda
Welcome - M. Jouvin
Security for Collaborating Infrastructures - D. Kelsey
DPM Collaboration and Plans
Cloud Pre-GDB Summary - M. Jouvin
Actions in Progress
- Ops Coordination Report
MW Lifecyle and Provisionning
- Potential Future Needs for WLCG - M. Schulz
- Discussion
OSG Update - B. Bockelmann

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=197806

Welcome - M. Jouvin

Michel insists on getting more contributors to the summary effort (note takers)

Every major region/country should contribute...

No GDB in August

No pre-GDB before December: no possible slot

WLCG workshop proposed November 12-13: general agreement from particpants

Clash with CMS offline SW week: to be sorted out quickly
- Proposal made later during GDB to hold it Nov 11-12 (Monday-Tuesday)
Need to receive asap proposals for topic to discuss: idea is to have a "thematic workshop" rather than a review of everything

Security for Collaborating Infrastructures - D. Kelsey

Trust: reliability, but even more predictability, is important for IT operations.

Long history of building trust through policy across "grid infrastructures" (EGEE, EGI, WLCG, ...)

JSPG initiative and work
Common Grid Acceptable Use Policy defined: signed by users when joigning a VO

Many new distributed computing initiatives nowadays (clouds, HPC, ...)

Security of Collaborating Infrastructures: an initiative built out of EGEE involving security officers from many initiatives/infrastructures

EGI, OSG, EUDAT, WLCG, CHAIN, XSEDE...
Developing a trust framework to manage cross-infrastructure security risks
- Implementations and probably policy wording don't have to be the same
WLCG is an obvious case

SCI document v1 submitted to ISGC proceedings

New meeting since then: new version underway
- Only currently public draft is 0.95
- Look at http://www.eugridpma.org/sci/
Defines a number of requirements in 6 areas
- Operational Security
- Incident response
- Traceability: a major requirement...
- Participant responsibilities: users, communities/VOs, resource providers. Different structure of responsibilities across various DCIs
- Legal issues
- Protection and processing of personal data
4 levels defined to describe how these requirements are met
Based on current best-practices: not a major change for WLCG
Documents are/shall be technology agnostic

Future plans

Contacts driven by authors with their infrastructure for comments and feedback

Dave will advertise to the list when the new version is available for further discussions

Discussion

Lothar : not only the infrastructure but also the overlay system could implement traceability. Preparing a document on this.
Is a formal feedback expected from WLCG: not presently, must wait the final document. At this point official MB feedback will be requested.

DPM Collaboration and Plans

DPM Collaboration created as part of the sustainability plan for DPM

Formally created with members from CERN, France, Italy, Taiwan, Japan, UK, Czech rep.
Manpower pledges: total ~4 FTE
- Manpower contribution defines ability to influence decisions
Specific, defined, responsibilities for each partner (see slides)
Covering DPM itself and related products (eg. LFC)

MW is growing up: exploit the collaborative open-source model

Some components are now "owned" by partners

DPM release strategy

DPM will release to EPEL almost everything: exceptions are metapackage and Oracle stuff that cannot be released to EPEL (will be released to EMI repo)
First released to EPEL-testing: picked up their by collaboration partners for early testing
- Experience needed with the short time a package stay in EPEL: is it compatible with our testing practices
Discussion with EGI so that they base the staged rollout on EPEL-testing rather than UMD repo
Discussing component release to allow more frequent updates but what about the versionning

Last version about to be released: 1.8.7

Big improvements to the Puppet config
DB connection pooling: 3x-5x performance improvement, more clients supported for the same configuration
StAR producer
Improved replica handling method
Brand new Xrootd (3.3) interface based on a DMlite plugin

12 month roadmap: influenced by users/partners, may change...

HTTP ecosystem: ensure that HTTP/DAV is a high performance, fully featured that HEP can use
- Impact on many other components, some not related to DPM itself (GFAL, FTS...)
Management of legacy components: review the need for RFIO and legacy daemons
New interfaces: GridFTP redirection, NFS 4.1
Management/performance tools: dpm-drain, rebalancing, I/O limit
New backends: VFS, S3, HDFS
- S3 of particular interest with DAV access
Configuration: complete move to Puppet, phase out YAIM
- Will not require a Puppet master infrastructure
SRM alternative: support all SRM-like actions through alternative interfaces

Discussion

Andrew: move to HTTP/DAV may be a requirement for sites moving to virtualized infrastructure where you cannot rely on local file systems

Cloud Pre-GDB Summary - M. Jouvin

6 months ago, a "working group" was formed to explore using private/community clouds as an alternative to a grid CE

No attempt to discuss all possible issues with clouds
No plan to make the move from a grid CE to a cloud mandatory for a site
Focus on shared clouds, used by severl VOs

Work presented in this area by ATLAS, CMS, LHCb

ATLAS & CMS deployed private clouds on their HLT farms and plan to continue using them after data taking restarts
- Allowed testing at scale of cloud MW (OpenStack)
CernVM is the base virtual image for all experiments.
CVMFS is key component to keep the images small (instantiation time) and reduce the need to update images
Contextualiation (user and site) used/tested by every experiments: amiconfig today because this is what is currently supported by CernVM, moving to cloudinit.

Graceful VM termination

Based on the machinefeatures ideas to advertize to the running jobs the VM termination time
- Also provides a shutdown command that can be used by the "VM users" else the VM is killed
Wished by every experiment: avoid waste of resources, allow to use resources efficiently until the VM is shutdown
- LHCb "Job masonry"
Benefit for sites: they can gracefully shutdown resources without impacting job success, leading to better availability/reliability

Fair use of shared clouds

Resources are finite: cannot achieve real elasticity
Sites are not aware of queuing requests: makes difficult to rebalance usage between VOs
New approach based on "credits" has been proposed
- Pretty complex to implement
- Based on some advertisement (cost) by sites: risk of inaccurate values or inconsistencies between sites
- Involve some sorte of brokering mechanism: risk to reproduce issues seen with WMS
Keep a process based on some sort of history/accounting
- Implement new request arbitration: (temporarily) reject new request by VOs above quota to give a chance to others
- Build a service allowing a site to submit VMs on behalf of VOs under quota: may be complex to implement (credential delegation). See Vac ideas below.

Vac(uum): a middleware developed at Manchester University, implementing the Infrastructure as a Client (IaaC) approach

Virtual machines are created spontaneously by sites (from the vacuum!), sharing/balancing resources between VOs.
- Based on desktop grid/volunteer computing ideas, like BOINC
- Not as pervasive as IaaS clouds but customized for infrastructures with a central queue like pilot frameworks
Runs the same VMs as a IaaS cloud
Deployed at 4 UK sites and used successfully by LHCb.
A demonstrator for graceful termination and fair usage ideas

Discussion

Luca: I think that many sites would like to keep a queuing infrastructure (batch system)
- MJ: outside a cloud one can put a batch system, but for pure cloud infrastructures this should not be necessary. The WG goal is to explore the possibility of a cloud without a batch system in front of it.
Jeff Templon (JT): in this case, need to change mandate of group, which is to replace grid by cloud.
- MJ: Mandate is to assess if a site may provide computing resources to WLCG experiments using a cloud rather than a grid CE (+ batch system). Does not mean that all sites will have to follow this path. No plan for such a decision in a foreseeable future.
Helge: was there a discussion about the possibility to run the T0 in the CERN cloud (AI)
- MJ: not, it was not discussed because nobody asked the question and the meeting was rather focused on reviewing what was already working
- Philippe Charpentier (PC): T0 is just another grid site, jobs would run on the CERN cloud without problems.

Actions in Progress

Ops Coordination Report

EMI-3 WN and UI

Being deployed at a few sites
One issue found with VOMS proxy: not recommended yet for production
Validation in progress

SL6 migration

T1s: 6/15 done, 3 well advanced
T2: 30/124
CVM issue requiring R/W access fixed by last kernels

glexec: deadline set to Oct. 1st

~100 tickets opened
Many sites coupling glexec support with SL6 migration

SHA-2

CERN CA available
DIRAC validated
Several ATLAS services validated: AGIS, PanDA, DDM
EGI just started submission of probes to test SHA-2 compliance
VOMS admin migration required but happening slowly

CVMFS

Deployment started for ALICE: tickets open to sites
- Deadline is April 2014, like CMS
Issue found in 2.1.11: fix will be soon available, wait for it before upgrading

FTS3

RAL production server ready, CERN one soon

Xrootd

~40 sites in AAA and FAX
Almost all plugins in the WLCG repository
Sites must register end points in GOC DB
DPM know issue with only local traffic being reported to monitoring

perfSONAR: sites recommended to upgrade to 3.3

Tracking tools

Savannah to JIRA migration progressing
New support unit "Grid Monitoring" to replace SAM and dashboards

ALICE plans

Increasingly committed to CVMFS
Working on rationalising the SAM tests: import results from MnALISA

ATLAS

Simulation validated for multi-core: would like to have access to more multi-core resources
All sites must deploy http/dav access by September
All sites shall deploy perfSonar
Migrating ATLAS central services to OpenStack
Stress-testing of Ruccio to begin this summer

CMS

Pilot now able to run severl single-threaded CMSSW
Commissionning of mutithreaded CMSSW expected by the end of the year
CRAB-PanDA integration open to beta testers
Xrootd fedaration: target of 90% of data by September
Disk/tape separation to be tested in autumn

LHCb

Introducing T2Ds with disk storage for analysis
Planning to use perfSONAR data as quality metric for deciding which T2 to use
Will look at implementing data popularity
Planning to use new C++11 feature requiring last gcc, only available on SL6 by the end of the year

IPv6: just started collaborating with HEPiX IPv6 WG

WLCG monitoring: starting to contribute a site perspective

Call for input by site monitoring experts: urgent feedback needed
A mailing list (egroup) list has been created: site experts are encouraged to join it
A Twiki page summarizes what is already available

Machine features: new TF created

Conclusion: steady progress of each task force

End of 2013 is a target date for many of them

MW Lifecyle and Provisionning

Potential Future Needs for WLCG - M. Schulz

Landscape after EMI

EMI produces UMD releases
- UMD ~= EMIrepo + staged rollout
INFN (Cristina) populates the EMI repositories with what is put in EPEL by PTs and announces changes
But no formal EMI QA process applied

How long this process is sustainable and can continue

At least until end of March 2014 (end of Cristina commitment)

ETICS: PTs are no longer using it

ETICS service will be shutdown end of August

WLCG repository created

Managed by Maarten
HEP_OSLIBS, xrootd plugins...
Packages that cannot fit in EPEL
UMD doesn't include them: they are community specific

EPEL: the main package repository but they are many packages that cannot fit into it

Packages not compliant with EPEL rules (in particular JAva packages)
Packages with not enough manpower to maintain them in EPEL

Impact of new situation on sites

Sites running pure UMD: no change
Site running UMD + WLCGrepo: no change
Sites installing from EMIrepo: will have to change their installation source after end of March

Specific SW

dCache: specific release process for T1s, followed by a few T2s
ARC: releases both to EPEL and EMI, no foreseen impact after the end of EMIrepo maintenance

Open questions

Long term management of EMI repositories: PT could release directly to UMD but what about QA verification
- Need to ensure that new version doesn't break other parts of the MW or experiment production, see effort done for EMI1/2 WN validation
- Need a WLCG-testing as there is an EPEL-testing
- Need some production resources using EPEL-testing to run real use case on new stuff
UI/WN for WLCG: current EMI version contains UNICORE that we don't need and doesn't contain WLCG specific packages. Do we want to have a WLCG WN/UI?
ARC and OSG: release directly in EPEL
Planning and PT synchronisation: probably easily done by pre-GDB/GDB, changes driven by the community
Infrastructure transition: a combination of EGI and WLCG Ops Coord efforts
WLCG baseline versions
Release latency with the normal process PT -> EMI -> UMD -> Sites
- Some sites uses snapshots of "WLCG versions" (advanced releases)
EPEL is a moving repo not under our control: our dependencies can move without notice/control
- Could have some control but requires to invest time/manpower

Still too early to tell whether the current process works

Without UMD, WLCG will need to invest

UMD covers most of our needs
Would be good to mirror the EPEL stable/testing structure for UMD and WLCG repositories

Discussion

Tiziana: QA is now handled by PTs by design

Markus: but there is not requirement to succeed QA before pushing something to repos

Tiziana: would be great to have more WLCG sites participating to the staged rollout activities

Markus: staged rollout is not doing a full coverage of WLCG use cases
Peter: EMI would be interested to see WLCG testing fed as input to the staged rollout (2 week deadline)
Markus: could be feasible if we succeed to configure some resources to automatically test what is in EPEL-testing. Still some work needed by volunteer sites and experiments. Need to decide as a community if we go for it: there will be a clear benefit if we succeeded to do it.
- May take advantage of experiment testing infrastructures like HammerCloud
- Need a clear statement from MB before starting the real work: will be discussed at next MB

OSG Update - B. Bockelmann

A Few organizational changes and new faces in the last 18 months

Executive Director: Lothar Bauerdick
Tehcnology Area: Brian Bockelmann
- SW team: Tim Cartwright
Networking Area: Shawn McKee
Production: Rob Quick (interim)
- Release Team: Tim Theisen
- Campus Grids: Rob Gardner

Latest release (3.1.21) released yesterday

Available mainly as RPM, tarball only for clients
OSG 3 is a successor for OSG 1.x and VDT 2.x
1.2 support dropped end of May

OSG moving to a predictable release schedule: second Tuesday of every month

Second possible slot for urgent updates if necessary: 4th Tuesday

SHA-2 readiness: all OSG stack is testing fine with SHA-2 except for dCache srmclient

OSG running several versions behind current version of srmclient: hope that upgrade will fix the problem (missed last release)

Java oddities: OSG has been working to rebase Java stack on OpenJDK 1.7

But RH packages are bringing some unwanted dependencies: tomcat6 requires Java 6 despite runing fine with 7; log4j bringing GCJ 1.5...
Java components making service release very large: OSG CE goes from 10 MB without Java component to 400 MB with Java dependencies
- Hope that RHEL7 will improve the situation

Other major projects in progress

Upgrade WLCG sites to HDFS2
WN into CVMFS
Collecting goals/requirements for next major release (3.2)
- Major release planned once a year

HTCondor-CE: continuing to work with Condor team to release a HTCondor-based CE

Release delayed to allow HTCondor 8.0.0 to get into
Long rollout expected: no urgent drivers but will allow to consolidate and reduce components to maintain

OSG security: concerns about the new automated bannin requirement, studying possible ways to implement it

No solution ready out-of-the-box
One possible solution is FNAL SAZ
Would like to see deployment timelines clarified: what are they for EGI sites?

CVMFS service started for OSG: OASIS

Includes a Stratum-0 (login host) and a Stratum-1 server
Open for OSG VOs who don't have a CVMFS server: not intended for WLCG
Monitoring and Stratum-1 shared with WLCG
Studiying deploying WN through OASIS

OSG Networking

Encouraging deployment of perfSONAR-PS across OSG sites
Working to aggregate measurments into a centralized dashboard: dashboard intended to be reusable outsde OSG, looking forward to discussing this with WLCG

Other non WLCG activities

Job submission over SSH
HTCondor backends for commonly used science apps like R
Meeting traceability requirements without use of grid certificates
Providing a submission env for NSF XD (XSEDE) allocations
Improving the glideinWMS submit interface

OSG usage; significant non WLCG and non HEP workload

Peak is ~40% of non WLCG workload

Started an OSGVO as an umbrella for users without a traditional grid VO

In particular for short-lived projects where creating a VO doesn't really make sense
Fullfilling traceability/security requirements for these users/projects

Discussion

Tiziana about automatic banning
- No firm timeline yet: end of year is probably a reasonable one...
- Will not impose a deadline that cannot be met because technology to do it is not available
- Central ARGUS server just set, will produce a file format in addition to direct publication through ARGUS
On central banning, implementation will be discussed next week: Argus technology, Translation of banned list for services/NGIs that don't use ARGUS.
- Timescale of couple of weeks for decision.
What about joint XACML/SAML work (interoperability profile)?
- ARGUS never implemented interoperability profile, also some interpretation disagreements
OSG CREAM CE evaluation status?
- Was done in June-July 2012 but in the end focussed on Condor based CE.

Topic revision: r1 - 2013-07-12 - MichelJouvin

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback