Summary of GDB meeting, July 10, 2013 (CERN)

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=197806

Welcome - M. Jouvin

Michel insists on getting more contributors to the summary effort (note takers)

  • Every major region/country should contribute...

No GDB in August

No pre-GDB before December: no possible slot

WLCG workshop proposed November 12-13: general agreement from particpants

  • Clash with CMS offline SW week: to be sorted out quickly
    • Proposal made later during GDB to hold it Nov 11-12 (Monday-Tuesday)
  • Need to receive asap proposals for topic to discuss: idea is to have a "thematic workshop" rather than a review of everything

Security for Collaborating Infrastructures - D. Kelsey

Trust: reliability, but even more predictability, is important for IT operations.

Long history of building trust through policy across "grid infrastructures" (EGEE, EGI, WLCG, ...)

  • JSPG initiative and work
  • Common Grid Acceptable Use Policy defined: signed by users when joigning a VO

Many new distributed computing initiatives nowadays (clouds, HPC, ...)

Security of Collaborating Infrastructures: an initiative built out of EGEE involving security officers from many initiatives/infrastructures

  • EGI, OSG, EUDAT, WLCG, CHAIN, XSEDE...
  • Developing a trust framework to manage cross-infrastructure security risks
    • Implementations and probably policy wording don't have to be the same
  • WLCG is an obvious case

SCI document v1 submitted to ISGC proceedings

  • New meeting since then: new version underway
  • Defines a number of requirements in 6 areas
    • Operational Security
    • Incident response
    • Traceability: a major requirement...
    • Participant responsibilities: users, communities/VOs, resource providers. Different structure of responsibilities across various DCIs
    • Legal issues
    • Protection and processing of personal data
  • 4 levels defined to describe how these requirements are met
  • Based on current best-practices: not a major change for WLCG
  • Documents are/shall be technology agnostic

Future plans

  • Contacts driven by authors with their infrastructure for comments and feedback

Dave will advertise to the list when the new version is available for further discussions

Discussion

  • Lothar : not only the infrastructure but also the overlay system could implement traceability. Preparing a document on this.
  • Is a formal feedback expected from WLCG: not presently, must wait the final document. At this point official MB feedback will be requested.

DPM Collaboration and Plans

DPM Collaboration created as part of the sustainability plan for DPM

  • Formally created with members from CERN, France, Italy, Taiwan, Japan, UK, Czech rep.
  • Manpower pledges: total ~4 FTE
    • Manpower contribution defines ability to influence decisions
  • Specific, defined, responsibilities for each partner (see slides)
  • Covering DPM itself and related products (eg. LFC)

MW is growing up: exploit the collaborative open-source model

  • Some components are now "owned" by partners

DPM release strategy

  • DPM will release to EPEL almost everything: exceptions are metapackage and Oracle stuff that cannot be released to EPEL (will be released to EMI repo)
  • First released to EPEL-testing: picked up their by collaboration partners for early testing
    • Experience needed with the short time a package stay in EPEL: is it compatible with our testing practices
  • Discussion with EGI so that they base the staged rollout on EPEL-testing rather than UMD repo
  • Discussing component release to allow more frequent updates but what about the versionning

Last version about to be released: 1.8.7

  • Big improvements to the Puppet config
  • DB connection pooling: 3x-5x performance improvement, more clients supported for the same configuration
  • StAR producer
  • Improved replica handling method
  • Brand new Xrootd (3.3) interface based on a DMlite plugin

12 month roadmap: influenced by users/partners, may change...

  • HTTP ecosystem: ensure that HTTP/DAV is a high performance, fully featured that HEP can use
    • Impact on many other components, some not related to DPM itself (GFAL, FTS...)
  • Management of legacy components: review the need for RFIO and legacy daemons
  • New interfaces: GridFTP redirection, NFS 4.1
  • Management/performance tools: dpm-drain, rebalancing, I/O limit
  • New backends: VFS, S3, HDFS
    • S3 of particular interest with DAV access
  • Configuration: complete move to Puppet, phase out YAIM
    • Will not require a Puppet master infrastructure
  • SRM alternative: support all SRM-like actions through alternative interfaces

Discussion

  • Andrew: move to HTTP/DAV may be a requirement for sites moving to virtualized infrastructure where you cannot rely on local file systems

Cloud Pre-GDB Summary - M. Jouvin

6 months ago, a "working group" was formed to explore using private/community clouds as an alternative to a grid CE

  • No attempt to discuss all possible issues with clouds
  • No plan to make the move from a grid CE to a cloud mandatory for a site
  • Focus on shared clouds, used by severl VOs

Work presented in this area by ATLAS, CMS, LHCb

  • ATLAS & CMS deployed private clouds on their HLT farms and plan to continue using them after data taking restarts
    • Allowed testing at scale of cloud MW (OpenStack)
  • CernVM is the base virtual image for all experiments.
  • CVMFS is key component to keep the images small (instantiation time) and reduce the need to update images
  • Contextualiation (user and site) used/tested by every experiments: amiconfig today because this is what is currently supported by CernVM, moving to cloudinit.

Graceful VM termination

  • Based on the machinefeatures ideas to advertize to the running jobs the VM termination time
    • Also provides a shutdown command that can be used by the "VM users" else the VM is killed
  • Wished by every experiment: avoid waste of resources, allow to use resources efficiently until the VM is shutdown
    • LHCb "Job masonry"
  • Benefit for sites: they can gracefully shutdown resources without impacting job success, leading to better availability/reliability

Fair use of shared clouds

  • Resources are finite: cannot achieve real elasticity
  • Sites are not aware of queuing requests: makes difficult to rebalance usage between VOs
  • New approach based on "credits" has been proposed
    • Pretty complex to implement
    • Based on some advertisement (cost) by sites: risk of inaccurate values or inconsistencies between sites
    • Involve some sorte of brokering mechanism: risk to reproduce issues seen with WMS
  • Keep a process based on some sort of history/accounting
    • Implement new request arbitration: (temporarily) reject new request by VOs above quota to give a chance to others
    • Build a service allowing a site to submit VMs on behalf of VOs under quota: may be complex to implement (credential delegation). See Vac ideas below.

Vac(uum): a middleware developed at Manchester University, implementing the Infrastructure as a Client (IaaC) approach

  • Virtual machines are created spontaneously by sites (from the vacuum!), sharing/balancing resources between VOs.
    • Based on desktop grid/volunteer computing ideas, like BOINC
    • Not as pervasive as IaaS clouds but customized for infrastructures with a central queue like pilot frameworks
  • Runs the same VMs as a IaaS cloud
  • Deployed at 4 UK sites and used successfully by LHCb.
  • A demonstrator for graceful termination and fair usage ideas

Discussion

  • Luca: I think that many sites would like to keep a queuing infrastructure (batch system)
    • MJ: outside a cloud one can put a batch system, but for pure cloud infrastructures this should not be necessary. The WG goal is to explore the possibility of a cloud without a batch system in front of it.
  • Jeff Templon (JT): in this case, need to change mandate of group, which is to replace grid by cloud.
    • MJ: Mandate is to assess if a site may provide computing resources to WLCG experiments using a cloud rather than a grid CE (+ batch system). Does not mean that all sites will have to follow this path. No plan for such a decision in a foreseeable future.
  • Helge: was there a discussion about the possibility to run the T0 in the CERN cloud (AI)
    • MJ: not, it was not discussed because nobody asked the question and the meeting was rather focused on reviewing what was already working
    • Philippe Charpentier (PC): T0 is just another grid site, jobs would run on the CERN cloud without problems.

Actions in Progress

Ops Coordination Report

EMI-3 WN and UI

  • Being deployed at a few sites
  • One issue found with VOMS proxy: not recommended yet for production
  • Validation in progress

SL6 migration

  • T1s: 6/15 done, 3 well advanced
  • T2: 30/124
  • CVM issue requiring R/W access fixed by last kernels

glexec: deadline set to Oct. 1st

  • ~100 tickets opened
  • Many sites coupling glexec support with SL6 migration

SHA-2

  • CERN CA available
  • DIRAC validated
  • Several ATLAS services validated: AGIS, PanDA, DDM
  • EGI just started submission of probes to test SHA-2 compliance
  • VOMS admin migration required but happening slowly

CVMFS

  • Deployment started for ALICE: tickets open to sites
    • Deadline is April 2014, like CMS
  • Issue found in 2.1.11: fix will be soon available, wait for it before upgrading

FTS3

  • RAL production server ready, CERN one soon

Xrootd

  • ~40 sites in AAA and FAX
  • Almost all plugins in the WLCG repository
  • Sites must register end points in GOC DB
  • DPM know issue with only local traffic being reported to monitoring

perfSONAR: sites recommended to upgrade to 3.3

Tracking tools

  • Savannah to JIRA migration progressing
  • New support unit "Grid Monitoring" to replace SAM and dashboards

ALICE plans

  • Increasingly committed to CVMFS
  • Working on rationalising the SAM tests: import results from MnALISA

ATLAS

  • Simulation validated for multi-core: would like to have access to more multi-core resources
  • All sites must deploy http/dav access by September
  • All sites shall deploy perfSonar
  • Migrating ATLAS central services to OpenStack
  • Stress-testing of Ruccio to begin this summer

CMS

  • Pilot now able to run severl single-threaded CMSSW
  • Commissionning of mutithreaded CMSSW expected by the end of the year
  • CRAB-PanDA integration open to beta testers
  • Xrootd fedaration: target of 90% of data by September
  • Disk/tape separation to be tested in autumn

LHCb

  • Introducing T2Ds with disk storage for analysis
  • Planning to use perfSONAR data as quality metric for deciding which T2 to use
  • Will look at implementing data popularity
  • Planning to use new C++11 feature requiring last gcc, only available on SL6 by the end of the year

IPv6: just started collaborating with HEPiX IPv6 WG

WLCG monitoring: starting to contribute a site perspective

  • Call for input by site monitoring experts: urgent feedback needed
  • A mailing list (egroup) list has been created: site experts are encouraged to join it
  • A Twiki page summarizes what is already available

Machine features: new TF created

Conclusion: steady progress of each task force

  • End of 2013 is a target date for many of them

MW Lifecyle and Provisionning

Potential Future Needs for WLCG - M. Schulz

Landscape after EMI

  • EMI produces UMD releases
    • UMD ~= EMIrepo + staged rollout
  • INFN (Cristina) populates the EMI repositories with what is put in EPEL by PTs and announces changes
  • But no formal EMI QA process applied

How long this process is sustainable and can continue

  • At least until end of March 2014 (end of Cristina commitment)

ETICS: PTs are no longer using it

  • ETICS service will be shutdown end of August

WLCG repository created

  • Managed by Maarten
  • HEP_OSLIBS, xrootd plugins...
  • Packages that cannot fit in EPEL
  • UMD doesn't include them: they are community specific

EPEL: the main package repository but they are many packages that cannot fit into it

  • Packages not compliant with EPEL rules (in particular JAva packages)
  • Packages with not enough manpower to maintain them in EPEL

Impact of new situation on sites

  • Sites running pure UMD: no change
  • Site running UMD + WLCGrepo: no change
  • Sites installing from EMIrepo: will have to change their installation source after end of March

Specific SW

  • dCache: specific release process for T1s, followed by a few T2s
  • ARC: releases both to EPEL and EMI, no foreseen impact after the end of EMIrepo maintenance

Open questions

  • Long term management of EMI repositories: PT could release directly to UMD but what about QA verification
    • Need to ensure that new version doesn't break other parts of the MW or experiment production, see effort done for EMI1/2 WN validation
    • Need a WLCG-testing as there is an EPEL-testing
    • Need some production resources using EPEL-testing to run real use case on new stuff
  • UI/WN for WLCG: current EMI version contains UNICORE that we don't need and doesn't contain WLCG specific packages. Do we want to have a WLCG WN/UI?
  • ARC and OSG: release directly in EPEL
  • Planning and PT synchronisation: probably easily done by pre-GDB/GDB, changes driven by the community
  • Infrastructure transition: a combination of EGI and WLCG Ops Coord efforts
  • WLCG baseline versions
  • Release latency with the normal process PT -> EMI -> UMD -> Sites
    • Some sites uses snapshots of "WLCG versions" (advanced releases)
  • EPEL is a moving repo not under our control: our dependencies can move without notice/control
    • Could have some control but requires to invest time/manpower

Still too early to tell whether the current process works

Without UMD, WLCG will need to invest

  • UMD covers most of our needs
  • Would be good to mirror the EPEL stable/testing structure for UMD and WLCG repositories

Discussion

Tiziana: QA is now handled by PTs by design

  • Markus: but there is not requirement to succeed QA before pushing something to repos

Tiziana: would be great to have more WLCG sites participating to the staged rollout activities

  • Markus: staged rollout is not doing a full coverage of WLCG use cases
  • Peter: EMI would be interested to see WLCG testing fed as input to the staged rollout (2 week deadline)
  • Markus: could be feasible if we succeed to configure some resources to automatically test what is in EPEL-testing. Still some work needed by volunteer sites and experiments. Need to decide as a community if we go for it: there will be a clear benefit if we succeeded to do it.
    • May take advantage of experiment testing infrastructures like HammerCloud
    • Need a clear statement from MB before starting the real work: will be discussed at next MB

OSG Update - B. Bockelmann

A Few organizational changes and new faces in the last 18 months

  • Executive Director: Lothar Bauerdick
  • Tehcnology Area: Brian Bockelmann
    • SW team: Tim Cartwright
  • Networking Area: Shawn McKee
  • Production: Rob Quick (interim)
    • Release Team: Tim Theisen
    • Campus Grids: Rob Gardner

Latest release (3.1.21) released yesterday

  • Available mainly as RPM, tarball only for clients
  • OSG 3 is a successor for OSG 1.x and VDT 2.x
  • 1.2 support dropped end of May

OSG moving to a predictable release schedule: second Tuesday of every month

  • Second possible slot for urgent updates if necessary: 4th Tuesday

SHA-2 readiness: all OSG stack is testing fine with SHA-2 except for dCache srmclient

  • OSG running several versions behind current version of srmclient: hope that upgrade will fix the problem (missed last release)

Java oddities: OSG has been working to rebase Java stack on OpenJDK 1.7

  • But RH packages are bringing some unwanted dependencies: tomcat6 requires Java 6 despite runing fine with 7; log4j bringing GCJ 1.5...
  • Java components making service release very large: OSG CE goes from 10 MB without Java component to 400 MB with Java dependencies
    • Hope that RHEL7 will improve the situation

Other major projects in progress

  • Upgrade WLCG sites to HDFS2
  • WN into CVMFS
  • Collecting goals/requirements for next major release (3.2)
    • Major release planned once a year

HTCondor-CE: continuing to work with Condor team to release a HTCondor-based CE

  • Release delayed to allow HTCondor 8.0.0 to get into
  • Long rollout expected: no urgent drivers but will allow to consolidate and reduce components to maintain

OSG security: concerns about the new automated bannin requirement, studying possible ways to implement it

  • No solution ready out-of-the-box
  • One possible solution is FNAL SAZ
  • Would like to see deployment timelines clarified: what are they for EGI sites?

CVMFS service started for OSG: OASIS

  • Includes a Stratum-0 (login host) and a Stratum-1 server
  • Open for OSG VOs who don't have a CVMFS server: not intended for WLCG
  • Monitoring and Stratum-1 shared with WLCG
  • Studiying deploying WN through OASIS

OSG Networking

  • Encouraging deployment of perfSONAR-PS across OSG sites
  • Working to aggregate measurments into a centralized dashboard: dashboard intended to be reusable outsde OSG, looking forward to discussing this with WLCG

Other non WLCG activities

  • Job submission over SSH
  • HTCondor backends for commonly used science apps like R
  • Meeting traceability requirements without use of grid certificates
  • Providing a submission env for NSF XD (XSEDE) allocations
  • Improving the glideinWMS submit interface

OSG usage; significant non WLCG and non HEP workload

  • Peak is ~40% of non WLCG workload

Started an OSGVO as an umbrella for users without a traditional grid VO

  • In particular for short-lived projects where creating a VO doesn't really make sense
  • Fullfilling traceability/security requirements for these users/projects

Discussion

  • Tiziana about automatic banning
    • No firm timeline yet: end of year is probably a reasonable one...
    • Will not impose a deadline that cannot be met because technology to do it is not available
    • Central ARGUS server just set, will produce a file format in addition to direct publication through ARGUS
  • On central banning, implementation will be discussed next week: Argus technology, Translation of banned list for services/NGIs that don't use ARGUS.
    • Timescale of couple of weeks for decision.
  • What about joint XACML/SAML work (interoperability profile)?
    • ARGUS never implemented interoperability profile, also some interpretation disagreements
  • OSG CREAM CE evaluation status?
    • Was done in June-July 2012 but in the end focussed on Condor based CE.

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2013-07-12 - MichelJouvin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback