Summary of GDB meeting, July 10, 2013 (CERN)
Agenda
https://indico.cern.ch/conferenceDisplay.py?confId=197806
Welcome - M. Jouvin
Michel insists on getting more contributors to the summary effort (note takers)
- Every major region/country should contribute...
No GDB in August
No pre-GDB before December: no possible slot
WLCG workshop proposed November 12-13: general agreement from particpants
- Clash with CMS offline SW week: to be sorted out quickly
- Proposal made later during GDB to hold it Nov 11-12 (Monday-Tuesday)
- Need to receive asap proposals for topic to discuss: idea is to have a "thematic workshop" rather than a review of everything
Security for Collaborating Infrastructures - D. Kelsey
Trust: reliability, but even more predictability, is important for IT operations.
Long history of building trust through policy across "grid infrastructures" (EGEE, EGI, WLCG, ...)
- JSPG initiative and work
- Common Grid Acceptable Use Policy defined: signed by users when joigning a VO
Many new distributed computing initiatives nowadays (clouds, HPC, ...)
Security of Collaborating Infrastructures: an initiative built out of EGEE involving security officers from many initiatives/infrastructures
- EGI, OSG, EUDAT, WLCG, CHAIN, XSEDE...
- Developing a trust framework to manage cross-infrastructure security risks
- Implementations and probably policy wording don't have to be the same
- WLCG is an obvious case
SCI document v1 submitted to ISGC proceedings
- New meeting since then: new version underway
- Defines a number of requirements in 6 areas
- Operational Security
- Incident response
- Traceability: a major requirement...
- Participant responsibilities: users, communities/VOs, resource providers. Different structure of responsibilities across various DCIs
- Legal issues
- Protection and processing of personal data
- 4 levels defined to describe how these requirements are met
- Based on current best-practices: not a major change for WLCG
- Documents are/shall be technology agnostic
Future plans
- Contacts driven by authors with their infrastructure for comments and feedback
Dave will advertise to the list when the new version is available for further discussions
Discussion
- Lothar : not only the infrastructure but also the overlay system could implement traceability. Preparing a document on this.
- Is a formal feedback expected from WLCG: not presently, must wait the final document. At this point official MB feedback will be requested.
DPM Collaboration and Plans
DPM Collaboration created as part of the sustainability plan for DPM
- Formally created with members from CERN, France, Italy, Taiwan, Japan, UK, Czech rep.
- Manpower pledges: total ~4 FTE
- Manpower contribution defines ability to influence decisions
- Specific, defined, responsibilities for each partner (see slides)
- Covering DPM itself and related products (eg. LFC)
MW is growing up: exploit the collaborative open-source model
- Some components are now "owned" by partners
DPM release strategy
- DPM will release to EPEL almost everything: exceptions are metapackage and Oracle stuff that cannot be released to EPEL (will be released to EMI repo)
- First released to EPEL-testing: picked up their by collaboration partners for early testing
- Experience needed with the short time a package stay in EPEL: is it compatible with our testing practices
- Discussion with EGI so that they base the staged rollout on EPEL-testing rather than UMD repo
- Discussing component release to allow more frequent updates but what about the versionning
Last version about to be released: 1.8.7
- Big improvements to the Puppet config
- DB connection pooling: 3x-5x performance improvement, more clients supported for the same configuration
- StAR producer
- Improved replica handling method
- Brand new Xrootd (3.3) interface based on a DMlite plugin
12 month roadmap: influenced by users/partners, may change...
- HTTP ecosystem: ensure that HTTP/DAV is a high performance, fully featured that HEP can use
- Impact on many other components, some not related to DPM itself (GFAL, FTS...)
- Management of legacy components: review the need for RFIO and legacy daemons
- New interfaces: GridFTP redirection, NFS 4.1
- Management/performance tools: dpm-drain, rebalancing, I/O limit
- New backends: VFS, S3, HDFS
- S3 of particular interest with DAV access
- Configuration: complete move to Puppet, phase out YAIM
- Will not require a Puppet master infrastructure
- SRM alternative: support all SRM-like actions through alternative interfaces
Discussion
- Andrew: move to HTTP/DAV may be a requirement for sites moving to virtualized infrastructure where you cannot rely on local file systems
Cloud Pre-GDB Summary - M. Jouvin
6 months ago, a "working group" was formed to explore using private/community clouds as an alternative to a grid CE
- No attempt to discuss all possible issues with clouds
- No plan to make the move from a grid CE to a cloud mandatory for a site
- Focus on shared clouds, used by severl VOs
Work presented in this area by ATLAS, CMS, LHCb
- ATLAS & CMS deployed private clouds on their HLT farms and plan to continue using them after data taking restarts
- Allowed testing at scale of cloud MW (OpenStack)
- CernVM is the base virtual image for all experiments.
- CVMFS is key component to keep the images small (instantiation time) and reduce the need to update images
- Contextualiation (user and site) used/tested by every experiments: amiconfig today because this is what is currently supported by CernVM, moving to cloudinit.
Graceful VM termination
- Based on the machinefeatures ideas to advertize to the running jobs the VM termination time
- Also provides a shutdown command that can be used by the "VM users" else the VM is killed
- Wished by every experiment: avoid waste of resources, allow to use resources efficiently until the VM is shutdown
- Benefit for sites: they can gracefully shutdown resources without impacting job success, leading to better availability/reliability
Fair use of shared clouds
- Resources are finite: cannot achieve real elasticity
- Sites are not aware of queuing requests: makes difficult to rebalance usage between VOs
- New approach based on "credits" has been proposed
- Pretty complex to implement
- Based on some advertisement (cost) by sites: risk of inaccurate values or inconsistencies between sites
- Involve some sorte of brokering mechanism: risk to reproduce issues seen with WMS
- Keep a process based on some sort of history/accounting
- Implement new request arbitration: (temporarily) reject new request by VOs above quota to give a chance to others
- Build a service allowing a site to submit VMs on behalf of VOs under quota: may be complex to implement (credential delegation). See Vac ideas below.
Vac(uum): a middleware developed at Manchester University, implementing the Infrastructure as a Client (
IaaC) approach
- Virtual machines are created spontaneously by sites (from the vacuum!), sharing/balancing resources between VOs.
- Based on desktop grid/volunteer computing ideas, like BOINC
- Not as pervasive as IaaS clouds but customized for infrastructures with a central queue like pilot frameworks
- Runs the same VMs as a IaaS cloud
- Deployed at 4 UK sites and used successfully by LHCb.
- A demonstrator for graceful termination and fair usage ideas
Discussion
- Luca: I think that many sites would like to keep a queuing infrastructure (batch system)
- MJ: outside a cloud one can put a batch system, but for pure cloud infrastructures this should not be necessary. The WG goal is to explore the possibility of a cloud without a batch system in front of it.
- Jeff Templon (JT): in this case, need to change mandate of group, which is to replace grid by cloud.
- MJ: Mandate is to assess if a site may provide computing resources to WLCG experiments using a cloud rather than a grid CE (+ batch system). Does not mean that all sites will have to follow this path. No plan for such a decision in a foreseeable future.
- Helge: was there a discussion about the possibility to run the T0 in the CERN cloud (AI)
- MJ: not, it was not discussed because nobody asked the question and the meeting was rather focused on reviewing what was already working
- Philippe Charpentier (PC): T0 is just another grid site, jobs would run on the CERN cloud without problems.
Actions in Progress
Ops Coordination Report
EMI-3 WN and UI
- Being deployed at a few sites
- One issue found with VOMS proxy: not recommended yet for production
- Validation in progress
SL6 migration
- T1s: 6/15 done, 3 well advanced
- T2: 30/124
- CVM issue requiring R/W access fixed by last kernels
glexec: deadline set to Oct. 1st
- ~100 tickets opened
- Many sites coupling glexec support with SL6 migration
SHA-2
- CERN CA available
- DIRAC validated
- Several ATLAS services validated: AGIS, PanDA, DDM
- EGI just started submission of probes to test SHA-2 compliance
- VOMS admin migration required but happening slowly
CVMFS
- Deployment started for ALICE: tickets open to sites
- Deadline is April 2014, like CMS
- Issue found in 2.1.11: fix will be soon available, wait for it before upgrading
FTS3
- RAL production server ready, CERN one soon
Xrootd
- ~40 sites in AAA and FAX
- Almost all plugins in the WLCG repository
- Sites must register end points in GOC DB
- DPM know issue with only local traffic being reported to monitoring
perfSONAR: sites recommended to upgrade to 3.3
Tracking tools
- Savannah to JIRA migration progressing
- New support unit "Grid Monitoring" to replace SAM and dashboards
ALICE plans
- Increasingly committed to CVMFS
- Working on rationalising the SAM tests: import results from MnALISA
ATLAS
- Simulation validated for multi-core: would like to have access to more multi-core resources
- All sites must deploy http/dav access by September
- All sites shall deploy perfSonar
- Migrating ATLAS central services to OpenStack
- Stress-testing of Ruccio to begin this summer
CMS
- Pilot now able to run severl single-threaded CMSSW
- Commissionning of mutithreaded CMSSW expected by the end of the year
- CRAB-PanDA integration open to beta testers
- Xrootd fedaration: target of 90% of data by September
- Disk/tape separation to be tested in autumn
LHCb
- Introducing T2Ds with disk storage for analysis
- Planning to use perfSONAR data as quality metric for deciding which T2 to use
- Will look at implementing data popularity
- Planning to use new C++11 feature requiring last gcc, only available on SL6 by the end of the year
IPv6: just started collaborating with
HEPiX IPv6 WG
WLCG monitoring: starting to contribute a site perspective
- Call for input by site monitoring experts: urgent feedback needed
- A mailing list (egroup) list has been created: site experts are encouraged to join it
- A Twiki page summarizes what is already available
Machine features: new TF created
Conclusion: steady progress of each task force
- End of 2013 is a target date for many of them
MW Lifecyle and Provisionning
Potential Future Needs for WLCG - M. Schulz
Landscape after EMI
- EMI produces UMD releases
- UMD ~= EMIrepo + staged rollout
- INFN (Cristina) populates the EMI repositories with what is put in EPEL by PTs and announces changes
- But no formal EMI QA process applied
How long this process is sustainable and can continue
- At least until end of March 2014 (end of Cristina commitment)
ETICS: PTs are no longer using it
- ETICS service will be shutdown end of August
WLCG repository created
- Managed by Maarten
- HEP_OSLIBS, xrootd plugins...
- Packages that cannot fit in EPEL
- UMD doesn't include them: they are community specific
EPEL: the main package repository but they are many packages that cannot fit into it
- Packages not compliant with EPEL rules (in particular JAva packages)
- Packages with not enough manpower to maintain them in EPEL
Impact of new situation on sites
- Sites running pure UMD: no change
- Site running UMD + WLCGrepo: no change
- Sites installing from EMIrepo: will have to change their installation source after end of March
Specific SW
- dCache: specific release process for T1s, followed by a few T2s
- ARC: releases both to EPEL and EMI, no foreseen impact after the end of EMIrepo maintenance
Open questions
- Long term management of EMI repositories: PT could release directly to UMD but what about QA verification
- Need to ensure that new version doesn't break other parts of the MW or experiment production, see effort done for EMI1/2 WN validation
- Need a WLCG-testing as there is an EPEL-testing
- Need some production resources using EPEL-testing to run real use case on new stuff
- UI/WN for WLCG: current EMI version contains UNICORE that we don't need and doesn't contain WLCG specific packages. Do we want to have a WLCG WN/UI?
- ARC and OSG: release directly in EPEL
- Planning and PT synchronisation: probably easily done by pre-GDB/GDB, changes driven by the community
- Infrastructure transition: a combination of EGI and WLCG Ops Coord efforts
- WLCG baseline versions
- Release latency with the normal process PT -> EMI -> UMD -> Sites
- Some sites uses snapshots of "WLCG versions" (advanced releases)
- EPEL is a moving repo not under our control: our dependencies can move without notice/control
- Could have some control but requires to invest time/manpower
Still too early to tell whether the current process works
Without UMD, WLCG will need to invest
- UMD covers most of our needs
- Would be good to mirror the EPEL stable/testing structure for UMD and WLCG repositories
Discussion
Tiziana: QA is now handled by PTs by design
- Markus: but there is not requirement to succeed QA before pushing something to repos
Tiziana: would be great to have more WLCG sites participating to the staged rollout activities
- Markus: staged rollout is not doing a full coverage of WLCG use cases
- Peter: EMI would be interested to see WLCG testing fed as input to the staged rollout (2 week deadline)
- Markus: could be feasible if we succeed to configure some resources to automatically test what is in EPEL-testing. Still some work needed by volunteer sites and experiments. Need to decide as a community if we go for it: there will be a clear benefit if we succeeded to do it.
- May take advantage of experiment testing infrastructures like HammerCloud
- Need a clear statement from MB before starting the real work: will be discussed at next MB
OSG Update - B. Bockelmann
A Few organizational changes and new faces in the last 18 months
- Executive Director: Lothar Bauerdick
- Tehcnology Area: Brian Bockelmann
- Networking Area: Shawn McKee
- Production: Rob Quick (interim)
- Release Team: Tim Theisen
- Campus Grids: Rob Gardner
Latest release (3.1.21) released yesterday
- Available mainly as RPM, tarball only for clients
- OSG 3 is a successor for OSG 1.x and VDT 2.x
- 1.2 support dropped end of May
OSG moving to a predictable release schedule: second Tuesday of every month
- Second possible slot for urgent updates if necessary: 4th Tuesday
SHA-2 readiness: all OSG stack is testing fine with SHA-2 except for dCache srmclient
- OSG running several versions behind current version of srmclient: hope that upgrade will fix the problem (missed last release)
Java oddities: OSG has been working to rebase Java stack on
OpenJDK 1.7
- But RH packages are bringing some unwanted dependencies: tomcat6 requires Java 6 despite runing fine with 7; log4j bringing GCJ 1.5...
- Java components making service release very large: OSG CE goes from 10 MB without Java component to 400 MB with Java dependencies
- Hope that RHEL7 will improve the situation
Other major projects in progress
- Upgrade WLCG sites to HDFS2
- WN into CVMFS
- Collecting goals/requirements for next major release (3.2)
- Major release planned once a year
HTCondor-CE: continuing to work with Condor team to release a HTCondor-based CE
- Release delayed to allow HTCondor 8.0.0 to get into
- Long rollout expected: no urgent drivers but will allow to consolidate and reduce components to maintain
OSG security: concerns about the new automated bannin requirement, studying possible ways to implement it
- No solution ready out-of-the-box
- One possible solution is FNAL SAZ
- Would like to see deployment timelines clarified: what are they for EGI sites?
CVMFS service started for OSG: OASIS
- Includes a Stratum-0 (login host) and a Stratum-1 server
- Open for OSG VOs who don't have a CVMFS server: not intended for WLCG
- Monitoring and Stratum-1 shared with WLCG
- Studiying deploying WN through OASIS
OSG Networking
- Encouraging deployment of perfSONAR-PS across OSG sites
- Working to aggregate measurments into a centralized dashboard: dashboard intended to be reusable outsde OSG, looking forward to discussing this with WLCG
Other non WLCG activities
- Job submission over SSH
- HTCondor backends for commonly used science apps like R
- Meeting traceability requirements without use of grid certificates
- Providing a submission env for NSF XD (XSEDE) allocations
- Improving the glideinWMS submit interface
OSG usage; significant non WLCG and non HEP workload
- Peak is ~40% of non WLCG workload
Started an OSGVO as an umbrella for users without a traditional grid VO
- In particular for short-lived projects where creating a VO doesn't really make sense
- Fullfilling traceability/security requirements for these users/projects
Discussion
- Tiziana about automatic banning
- No firm timeline yet: end of year is probably a reasonable one...
- Will not impose a deadline that cannot be met because technology to do it is not available
- Central ARGUS server just set, will produce a file format in addition to direct publication through ARGUS
- On central banning, implementation will be discussed next week: Argus technology, Translation of banned list for services/NGIs that don't use ARGUS.
- Timescale of couple of weeks for decision.
- What about joint XACML/SAML work (interoperability profile)?
- ARGUS never implemented interoperability profile, also some interpretation disagreements
- OSG CREAM CE evaluation status?
- Was done in June-July 2012 but in the end focussed on Condor based CE.