Summary of GDB meeting, October 10, 2012 (LAPP, Annecy)
Agenda
http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=155073
Introduction
- CERN room for upcoming meetings uncertain: be sure to check before the meeting
- Clash in April: GDB / egi community forum. Move the week before?
- details of forthcoming topics in GDBs, slides for details
EMI-2 Transition
Moving to Supported MW - M. Schulz
EGI started to open tickets against sites running gLite 3.1 services and gLite 3.2 unsupported services (BDII,
CREAM)
- 75 sites still running gLite 3.1 services
- 163 sites run gLite 3.2 CREAM or BDII
- No expertise left on how to build a patch for these services
Move by end of November requested by WLCG for all unsupported services (3.1 and 3.2)
- No real difficulty, in particular for CE: can start a new one in //
- Still a problem for the CREAM CE with Torque: need to upgrade WN at the same time, must wait end of validation
- All other 3.2 services (eg. DPM, LFC) to be upgraded by end of 2012
- Their end of support is end of November, so at risk after this date
Retirement of gLite 3.1 and unsupported 3.2 - T. Ferrari
Basically all products, except WN/UI, LFC, DPM, glexec
Many broadcast to sites.
Security dashboard setup as part of the Operations Portal: daily progress reports by NGI
- Sites still running obsolete services are in violation of ops policy
Starting beginning of November, open tickets will be escalated to EGI CSIRT
- Sites will be eligible for suspension if they don't respond or if there is no documented plans and timelines for upgrade
- An answer explaining the technical reason for the delayed upgrade may be considered valid
- In case of high security risk, sites will be asked to decommission the affected service in the next 24h
Don't forget that EMI-1 end of life is 30/4/2013: recommended migration is to EMI-2
- Anyway, SHA-2 issue requires complete upgrade of the infrastructure to EMI-2/EMI-3 by Aug 1, 2013
EMI discussing a retirement policy where sites will be required to decommission service no later than 1 month after its end of life.
- Will probably come into force after end of life of EMI-1
EMI-2 WN Validation - M. Litmaath
3 wiki pages: WLCG, Atlas, CMS
- 6 sites covering different SEs, mainly SL5, 2 with SL6 WNs
- 2 sites (CNAF and BRUNEL) is EMI-1 which is valid for the short term
- CNAF is interested to participate to SL6
ATLAS Status
- Main issue is gsidcap access failing with limited proxies: ticket opened for dCache
- CMS has not seen/reported the same pb but they seem not use gsidcap but only dcap
- Even not all ATLAS dCache sites
CMS Status
- CASTOR RFIO and dCache dcap are ok with both CMSSW_4_4_5 and 5_3_1
- 4_4_5 requires a hack for DPM RFIO
- dCache gsidcap requires a hack for all versions
- Hacks considered as tolerable
- GPFS validated too
WLCG page will be updated with the matrix WN version vs. SE implementation
- Sites should find themselves in the matrix to understand if they can upgrade
A WN update is in progress to include outstanding fixes.
Suggestion: keep a set of test queues to exercice new versions of WN that are exercised by
HammerCloud with representative jobs
Conclusion: no reason preventing upgrade of WNs after the next EMI release expected on Oct. the 22nd
- Sites can already plan their upgrades for this
Discussion
Make clear to sites what is the decision
- They are required to have plans by '''beginning of November''' and communicate them to EGI on request (tickets)
- WLCG expects plans to be enforced by '''end of November'''
discussion - WLCG message, upgrade by end November
- EGI message, give us your plan or risk suspension from 1st Nov
these are consistent
Note the SHA-2 issue, which requires complete upgrade of infrastructure by 01/08/2013
Borut: ATLAS agrees with plan as long as a significant T1 downtime is not involved
- Need coordination to ensure that not all T1s are updating at the same time
Software Lifecycle Process after EMI - M. Schulz
A draft is available, attached to the agenda, summarizing the discussions in the last 3 months.
- Starting point for further discussions
Proposal summary (see slide 3)
- Roadmaps and priorities via pre-GDB
- Independent product teams
- Integration with EPEL
- Repositories: UMD/WLCG, EPEL-Stable, EPEL-Test, Legacy
- Move to a collaboration-based contribution for configuration, packaging, batch support
- Foster collaboration with EGI to avoid divergence
- In particular for rollout
Building tools: no requirement for ETICS
Ian: WLCG statement required endorsing this
- To be discussed at a future MB
ALICE SW Distirbution Infrastructure - C. Grigoras
Move to Torrent to overcome problems (stability and security) with previous shared SW area
- Pre-config of sites preventing opportunistic uses
- SW version management complexity
- NFS scalability
- VOBOX security risk
build populates alien file catalogue and repo, which is then seeded.
Torrent ALICE configuration
- Tracker and seeder run centrally at CERN
- Fed directly by build servers
- Connection to the world limited to 50 MB/s
- No intersite Torrent communication: only package exchange is between WN and WN and CERN seeder
- SW repository ~150 GB for all versions
- ~300 MB download by each job
Job workflow
- Download of a bootstrap script (pilot job)
- Get a job from the task queue
- Download required packages for the job
How peers are discovered
- Clients publish their private IP in the tracker: allow exchange with private IPs on the same network
- Multicast to discover peers on the same network
Firewall config: documented, see slides in particular
- Outgoing to alitorrent.cern.ch on bittorrent ports
- WN to WN
- See presentation for details
Status
- Already the 2nd year for larger sites, eg. CERN
- One month ago, permission asked to each site to use torrent: very few refused
- Working with the remaining sites to solve the (mostly) non-technical issues
- T1: 5/6 migrated, T2s: 36/78 migrated (mostly last month)
- Aiming for full migration before next AliEn version is deployed: would like to drop Packman service on VOBOX
- This will remove the request for space on VOBOX
CNAF reports an issue with short-lived jobs which flooded their site causing network saturation and impact on CVMFS (high disk load)
- Was due to a problem, now fixed by ALICE
Federated Storage WG - F. Furano
Main experiment expectations
- Easy/reliable direct access to data
- Coherent file naming with access to everything from everywhere
Good agreement on the vision, more a problem of difference of the way this is provided to applications
Agreement to separate federation implementation from what applications want to achieve
Possible applications
- Fail-over: client is redirected to another location
- Self-healing: the storage corrects the problem (missing file), transparently to apps
- Site analysis cache: T3s don't want to preplace files but want them to remain in place as long as they are used
- Often repeated access to the same files
- Could participate to the federation for the files they have in cache (until they are expired)
Tightly coupled to experiment namespaces and catalogues: specific to each experiment
- ATLAS specific issue with LFN->SFN translation because they decorated/mangled SFN with site specific tokens
- Currently (FAX prototype), they wrote an Xrootd plugin to ask LFC on the fly to get the SFN
Security: make the data of a storage federation readable to anybody in the collaboration
- Authentication is still required...
Raising needs of standard tools to access the federation to avoid duplication by each exp with risk of incompatibility
Xroot and http/Dav are the main candidates: experiments should coordinate to maximise the chance of http to be adopted
Prototypes available both for ATLAS (FAX) and CMS (AAA)
- Most of the data about to be available through these federations
Plan to make a tech review/monography of potential technologies: Xroot, http
Discussion
- Philippe argues LFC constitutes a federation: nobody disagrees but WG intends to do more...
Actions in Progress
Operations Coordination - M. Girone
Kicked off on Sep. 24
- Internal task forces are the driving force
Fortnightly meetings on 1st and 3d Thursday of each month
Good representation of experiments and sites
- T1 representatives: some missing, would be great to have all of them
- EGI, EMI and OSG represented
Mailing list
wlcg-operations@cernNOSPAMPLEASE.ch will be filled with all contact names registered in GOCDB
Quarterly Operations planning: a meeting to prepare plans and propose them to MB
Task forces currently created
- CVMFS deployment
- glexec deployment
- perfSONAR deployment
- Tool evolutions
- Squid monitoring
- FTS3 validation and deployment
- SHA-2/RFC proxies
- ... see presentation
WN Environment Information - U. Schwickerath
Proposal on Twiki: almost final, feedback possible/welcome
2 environment variables set for each job
- $MACHINEFEATURES: points to a directory describing machine environment like hs06, shutdowntime, jobslots, number of cores
- $JOBFEATURES: points to a directory giving information specific to the current job (cpu/wallclock limit, disk limit, mem_limit, allocated_CPU...)
Implementation availability
- Torque 2.5/PBS 2.3: maintained by NIKHEF, probably not implementing the last specification
- PBS Professional and Grid Engine: KIT, up-to-date
- LSF: Ulrich, up-to-date, in production since last Monday
- Several hacks to work-around the fact that LSF cannot be queried directly: may require some tweaking at other sites
Discussion on validation strategy
- Atlas (Simone): there are things atlas could use straight away
HUC Support after EGI-Inspire SA3 - J. Shiers
HUC in EGI-Inspire (SA3) = HEP + Life Sciences (5 VOs) + Astro-physics + Earth Science
- HEP = the only world-wide community, 93% of consumed CPU time
- SA3 was a 3 year effort in a 4 year project
- End affects more than just HEP
Main SA3 goal: transition to sustainable support
- Identify tools that benefit several communities
- Original plan was to move these tools as part of the core infrastructure, but it was understood it is better to have them maintained by those who use them
Achievements:
- Non-distruptive transition from EGEE to EGI during LHC data taking
- Continous improvement
- Rapid fixing of problems
- Identification and delivery of common solutions
- All a direct benefit of a multi-year and multi-community effort: not to be forgotten when looking at the future
Today: less incidents than in the past, lower impact of those incidents on experiment work
EGI review stated SA3 had exceeded its objectives.
IT-ES (Experiment Support) is a unique resource in WLCG, with addition of a few people from other groups (in IT and PH)
- Largely funded by EGI-Inspire SA3
- 10 people with advanced skills leaving end of April...
Sustainability plans for SA3: see slide 30 (taken from sustainability section of last deliverable)
- Persistency frameworks: POOL by experiences who use it, COOL/CORAL by CERN IT
- Data analysis tools: long-term effort, outside SA3 scope
- Data management tools: long-term support taken over by PH department
- Ganga: CERN will stop its involvement a few months after SA3 end, to be picked up by other institutes that are members of the project
- Experiment dashboard: all requested features delivered, long term support covered by CERN-Russia and CERN-India agreements
FP8: first call expected in 2013 with first funds in early 2014
- Calls related to data management and data preservation are very likely
- We need to decide if we want to be part of these projects: hope yes
- Will not solve the problems directly related to experiment support nor address the gap between SA3 and these projects
Discussion with OSG/US about an IETF-like task force around data management
A few meetings on the road to build consortiums to apply to future calls
- Barcelona (October): open access and data preservation
- Data preservation is a hot topic and probably a requirement for future projects
- March: SA3 workshop?
Ian Bird's remarks
- If we want to attract new money, we need to do multi-community work (attract other sciences)
- Require adopting standard protocol
- Horizon is 2020: need a grand challenge at this level
- We cannot compensate the people we we'll lose: need to prioritize what we want to continue, if we continue integration, we need to stop someting else
Ian Fisk's remark
- Still a medium term continuity problem between 2020 and now
- Best success of EGI was this multi-community support with a team that took years to build: will take time to recover from the loss if it happens...
Ian Bird: need prioritisation, to lose some tasks, or find other effort within the collaboration and outside CERN
- Forthcoming meeting with exp computing coordinators and IT and PH depts
Follow-up discussion with experiments at a future GDB: December?
OSG Update - B. Bockelman
Recent news
- Alan Roy left OSG
- Tim Cartwright took over the interim lead
- Hiring a new person should happen in a few months after refining the position
Release cycle: 2 releases a month, one "major/significant", the other one for fixes
- Current release : 3.1.10
- 51% CE running OSG3 in 1 year but 49% still on OSG1.2 which is on "critical security updates only"
Worked to improve the coordination with Globus team
- Currently applying 30-40 patches on top of Glous code: would like to lower this number
- 4 joint projects related to GRAM
- Now have commit rights to Globus repository
- opportunity for European patches to Globus (contact Brian)
SHA-2 updates: OSG trying to keep the internal deadline of Jan 1, 2013
- Has been a good exercise: beginning to see the light at the end of the tunnel
- Known minor issue with Gratia server
- Xrootd not yet checked and probably broken
- Central service survey not yet complete but no major problem expected
- Having access to jglobus2 repository: lot of patches merged
- Performance seems acceptable for the server but quite noticeably slower: clients not heavily used (apart dCache)
- jglobus2 can work with SHA-1 certificates and support legacy/pre-RFC/RFC proxies: may allow to decouple RFC and SHA-2 proxy support
- Tested client/server backward compatibility with older jglobus: non trivial fixes required
- Basically provides BeStMan2 and dCache with a simple path forward
- Maarten (after the meeting): the dCache developers warn against being too optimistic here! They will investigate if the improved JGlobus-2 is sufficiently OK for the dCache deployments and use cases that we need to care about. To be continued...
Info service
- So far neither USCMS nor USATLAS have identified any BDII usage they expect to continue on the long term
- Recently, after discussion with O. Keeble, decided that FTS3 would be better-off using MyOSG info services instead of BDII
- Used in fact only for monitoring
- Hence, no plan to add GLUE2 to the information published into the BDII: controversial but matches well local requirements and effort level
- Impact is not clear on European side usage of BDII
- Tiziana: GOCDB planning to rely on GLUE2 interface with XML rendering. Is it something considered by OSG?
Multicore support: working on implementing GRAM multicore interface
- Aware of similar work for Condor
- Not as used as hoped: 1/2 T2s have support enabled
WN environment variables: "watch and wait" for what happens with the proposal
CE strategy
- Worried about being tied to one particular technology: want to pursue several directions
- Continue to shine and polish GRAM
- One BNL-based team looking at CREAM: EMI-2 version in OSG testing repo
- Condor-CE also being looked at: still a few months needed before having something releasable
pre-GDB summary - W. Bhimji
Benchmarking tools: several now available, either on the application side (ROOT) or on the storage side (DPM, EOS)
- Still have to converge on the top-10 metrics
Storage interfaces: progress on the functionality map and possible alternative to SRM
- Very few areas where there is no alternative to SRM
- GFAL2 is promising for solving some of the issues
- SURL-TURL translation: redirection is required to replace SRM in a sustainable way (with hacks)
Network WG - M. Ernst
April C-RRB report encourages to continue implementation of an intelligent data management allowing efficient and cost-efficient access to data.
- In particular impact on network bandwidth should be considered
WG created to be a forum between application developers and network experts/providers
- Not existing so far: LHCOPN is only about infrastructure, LHCONE and similar projects multiplying the places where there was a dialog (not always coordinated) between experiments and big sites
- Within LHCONE network providers often make assumptions on what experiments are trying todo
- Their main reference remains the "Bos/Fisk note"...
- WLCG is primarily concerned with CPU and storage
- Propose WG conduct a study on network requirements
- Experiments must realize that network is not a passive pipe
- Network providers are looking to engage us in traffic engineering
- Dynamic circuits and other intelligent services require active participation of applications to provide additional B/W to high-piority data
WG focusing on network infrastructure and network services: aims at finding common solutions across experiments and if possible agree on a strategic direction
- Address functional and perf issues vertically in the network stack
- Examples shown using atlas where network is used in brokering decisions or for better placement decisions
- Interface between LHC community and network providers
- Voice LHC common position in a coherent way
Network controlled by applications is the next step: Software Defined Networks
- OpenFlow is the current solution adopted by several major players (Google, network equipment manufacturers)
- Is LHC community just waiting or trying to participate in this development?
- Researchers eager for dialogue now
Virtualized Network Infrastructure for Domain Specific Applications
Network abstractions is a key research topic.
Network community is eager to see LHC community involved
WG members
- Experiment representatives nominated
- Sites: in progress through CB first, suggest rotation to foster diversity
- Network providers: hard to identify who is the body to contact...
- Here also rotation is probably a good thing
Meeting on Intelligent Network Services at CERN on Dec. 13
Snowmass 2013 has computing and networking as an explicit topic/subtopic
- Snowmass is tasked to charter HEP for the next 10 years in US...
- Michael co-chair of the network subgroup
--
MichelJouvin - 15-Oct-2012