LCG Web>WLCGGDBDocs>GDBMeetingNotes20121010 (2012-10-17, MaartenLitmaath)

EditAttachPDF

Summary of GDB meeting, October 10, 2012 (LAPP, Annecy)

Agenda
Introduction
EMI-2 Transition
Software Lifecycle Process after EMI - M. Schulz
ALICE SW Distirbution Infrastructure - C. Grigoras
Federated Storage WG - F. Furano
Actions in Progress
- Operations Coordination - M. Girone
- WN Environment Information - U. Schwickerath
HUC Support after EGI-Inspire SA3 - J. Shiers
OSG Update - B. Bockelman
pre-GDB summary - W. Bhimji
Network WG - M. Ernst

Agenda

http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=155073

Introduction

CERN room for upcoming meetings uncertain: be sure to check before the meeting
Clash in April: GDB / egi community forum. Move the week before?
details of forthcoming topics in GDBs, slides for details

EMI-2 Transition

Moving to Supported MW - M. Schulz

EGI started to open tickets against sites running gLite 3.1 services and gLite 3.2 unsupported services (BDII, CREAM)

75 sites still running gLite 3.1 services
163 sites run gLite 3.2 CREAM or BDII
No expertise left on how to build a patch for these services

Move by end of November requested by WLCG for all unsupported services (3.1 and 3.2)

No real difficulty, in particular for CE: can start a new one in //
- Still a problem for the CREAM CE with Torque: need to upgrade WN at the same time, must wait end of validation
All other 3.2 services (eg. DPM, LFC) to be upgraded by end of 2012
- Their end of support is end of November, so at risk after this date

Retirement of gLite 3.1 and unsupported 3.2 - T. Ferrari

Basically all products, except WN/UI, LFC, DPM, glexec

Many broadcast to sites.

Security dashboard setup as part of the Operations Portal: daily progress reports by NGI

Sites still running obsolete services are in violation of ops policy

Starting beginning of November, open tickets will be escalated to EGI CSIRT

Sites will be eligible for suspension if they don't respond or if there is no documented plans and timelines for upgrade
- An answer explaining the technical reason for the delayed upgrade may be considered valid
In case of high security risk, sites will be asked to decommission the affected service in the next 24h

Don't forget that EMI-1 end of life is 30/4/2013: recommended migration is to EMI-2

Anyway, SHA-2 issue requires complete upgrade of the infrastructure to EMI-2/EMI-3 by Aug 1, 2013

EMI discussing a retirement policy where sites will be required to decommission service no later than 1 month after its end of life.

Will probably come into force after end of life of EMI-1

EMI-2 WN Validation - M. Litmaath

3 wiki pages: WLCG, Atlas, CMS

6 sites covering different SEs, mainly SL5, 2 with SL6 WNs
- 2 sites (CNAF and BRUNEL) is EMI-1 which is valid for the short term
CNAF is interested to participate to SL6

ATLAS Status

Main issue is gsidcap access failing with limited proxies: ticket opened for dCache
- CMS has not seen/reported the same pb but they seem not use gsidcap but only dcap
- Even not all ATLAS dCache sites

CMS Status

CASTOR RFIO and dCache dcap are ok with both CMSSW_4_4_5 and 5_3_1
4_4_5 requires a hack for DPM RFIO
dCache gsidcap requires a hack for all versions
Hacks considered as tolerable
GPFS validated too

WLCG page will be updated with the matrix WN version vs. SE implementation

Sites should find themselves in the matrix to understand if they can upgrade

A WN update is in progress to include outstanding fixes.

Suggestion: keep a set of test queues to exercice new versions of WN that are exercised by HammerCloud with representative jobs

Conclusion: no reason preventing upgrade of WNs after the next EMI release expected on Oct. the 22nd

Sites can already plan their upgrades for this

Discussion

Make clear to sites what is the decision

They are required to have plans by '''beginning of November''' and communicate them to EGI on request (tickets)
WLCG expects plans to be enforced by '''end of November'''

discussion - WLCG message, upgrade by end November - EGI message, give us your plan or risk suspension from 1st Nov these are consistent

Note the SHA-2 issue, which requires complete upgrade of infrastructure by 01/08/2013

Borut: ATLAS agrees with plan as long as a significant T1 downtime is not involved

Need coordination to ensure that not all T1s are updating at the same time

Software Lifecycle Process after EMI - M. Schulz

A draft is available, attached to the agenda, summarizing the discussions in the last 3 months.

Starting point for further discussions

Proposal summary (see slide 3)

Roadmaps and priorities via pre-GDB
- Requirements via GGUS
Independent product teams
Integration with EPEL
Repositories: UMD/WLCG, EPEL-Stable, EPEL-Test, Legacy
Move to a collaboration-based contribution for configuration, packaging, batch support
Foster collaboration with EGI to avoid divergence
- In particular for rollout

Building tools: no requirement for ETICS

Ian: WLCG statement required endorsing this

To be discussed at a future MB

ALICE SW Distirbution Infrastructure - C. Grigoras

Move to Torrent to overcome problems (stability and security) with previous shared SW area

Pre-config of sites preventing opportunistic uses
SW version management complexity
NFS scalability
VOBOX security risk

build populates alien file catalogue and repo, which is then seeded.

Torrent ALICE configuration

Tracker and seeder run centrally at CERN
- Fed directly by build servers
- Connection to the world limited to 50 MB/s
No intersite Torrent communication: only package exchange is between WN and WN and CERN seeder
SW repository ~150 GB for all versions
- ~300 MB download by each job

Job workflow

Download of a bootstrap script (pilot job)
Get a job from the task queue
Download required packages for the job

How peers are discovered

Clients publish their private IP in the tracker: allow exchange with private IPs on the same network
Multicast to discover peers on the same network

Firewall config: documented, see slides in particular

Outgoing to alitorrent.cern.ch on bittorrent ports
WN to WN
See presentation for details

Status

Already the 2nd year for larger sites, eg. CERN
One month ago, permission asked to each site to use torrent: very few refused
Working with the remaining sites to solve the (mostly) non-technical issues
T1: 5/6 migrated, T2s: 36/78 migrated (mostly last month)
Aiming for full migration before next AliEn version is deployed: would like to drop Packman service on VOBOX
- This will remove the request for space on VOBOX

CNAF reports an issue with short-lived jobs which flooded their site causing network saturation and impact on CVMFS (high disk load)

Was due to a problem, now fixed by ALICE

Federated Storage WG - F. Furano

Main experiment expectations

Easy/reliable direct access to data
Coherent file naming with access to everything from everywhere
- Including WAN access

Good agreement on the vision, more a problem of difference of the way this is provided to applications

Agreement to separate federation implementation from what applications want to achieve

Possible applications

Fail-over: client is redirected to another location
Self-healing: the storage corrects the problem (missing file), transparently to apps
Site analysis cache: T3s don't want to preplace files but want them to remain in place as long as they are used
- Often repeated access to the same files
- Could participate to the federation for the files they have in cache (until they are expired)

Tightly coupled to experiment namespaces and catalogues: specific to each experiment

ATLAS specific issue with LFN->SFN translation because they decorated/mangled SFN with site specific tokens
- Currently (FAX prototype), they wrote an Xrootd plugin to ask LFC on the fly to get the SFN

Security: make the data of a storage federation readable to anybody in the collaboration

Authentication is still required...

Raising needs of standard tools to access the federation to avoid duplication by each exp with risk of incompatibility

Xroot and http/Dav are the main candidates: experiments should coordinate to maximise the chance of http to be adopted

Prototypes available both for ATLAS (FAX) and CMS (AAA)

Most of the data about to be available through these federations

Plan to make a tech review/monography of potential technologies: Xroot, http

Discussion

Philippe argues LFC constitutes a federation: nobody disagrees but WG intends to do more...

Actions in Progress

Operations Coordination - M. Girone

Kicked off on Sep. 24

Internal task forces are the driving force

Fortnightly meetings on 1st and 3d Thursday of each month

Daily meetings remain

Good representation of experiments and sites

T1 representatives: some missing, would be great to have all of them
EGI, EMI and OSG represented

Mailing list wlcg-operations@cernNOSPAMPLEASE.ch will be filled with all contact names registered in GOCDB

Quarterly Operations planning: a meeting to prepare plans and propose them to MB

Task forces currently created

CVMFS deployment
glexec deployment
perfSONAR deployment
Tool evolutions
Squid monitoring
FTS3 validation and deployment
SHA-2/RFC proxies
... see presentation

WN Environment Information - U. Schwickerath

Proposal on Twiki: almost final, feedback possible/welcome

2 environment variables set for each job

$MACHINEFEATURES: points to a directory describing machine environment like hs06, shutdowntime, jobslots, number of cores
- One file per information
$JOBFEATURES: points to a directory giving information specific to the current job (cpu/wallclock limit, disk limit, mem_limit, allocated_CPU...)
- One file per information

Implementation availability

Torque 2.5/PBS 2.3: maintained by NIKHEF, probably not implementing the last specification
- To be checked by Jeff
PBS Professional and Grid Engine: KIT, up-to-date
LSF: Ulrich, up-to-date, in production since last Monday
- Several hacks to work-around the fact that LSF cannot be queried directly: may require some tweaking at other sites

Discussion on validation strategy

Atlas (Simone): there are things atlas could use straight away

HUC Support after EGI-Inspire SA3 - J. Shiers

HUC in EGI-Inspire (SA3) = HEP + Life Sciences (5 VOs) + Astro-physics + Earth Science

HEP = the only world-wide community, 93% of consumed CPU time
SA3 was a 3 year effort in a 4 year project
End affects more than just HEP

Main SA3 goal: transition to sustainable support

Identify tools that benefit several communities
- Original plan was to move these tools as part of the core infrastructure, but it was understood it is better to have them maintained by those who use them

Achievements:

Non-distruptive transition from EGEE to EGI during LHC data taking
Continous improvement
Rapid fixing of problems
Identification and delivery of common solutions
All a direct benefit of a multi-year and multi-community effort: not to be forgotten when looking at the future

Today: less incidents than in the past, lower impact of those incidents on experiment work

EGI review stated SA3 had exceeded its objectives.

IT-ES (Experiment Support) is a unique resource in WLCG, with addition of a few people from other groups (in IT and PH)

Largely funded by EGI-Inspire SA3
10 people with advanced skills leaving end of April...

Sustainability plans for SA3: see slide 30 (taken from sustainability section of last deliverable)

Persistency frameworks: POOL by experiences who use it, COOL/CORAL by CERN IT
Data analysis tools: long-term effort, outside SA3 scope
Data management tools: long-term support taken over by PH department
Ganga: CERN will stop its involvement a few months after SA3 end, to be picked up by other institutes that are members of the project
Experiment dashboard: all requested features delivered, long term support covered by CERN-Russia and CERN-India agreements

FP8: first call expected in 2013 with first funds in early 2014

Calls related to data management and data preservation are very likely
- We need to decide if we want to be part of these projects: hope yes
Will not solve the problems directly related to experiment support nor address the gap between SA3 and these projects

Discussion with OSG/US about an IETF-like task force around data management

A few meetings on the road to build consortiums to apply to future calls

Barcelona (October): open access and data preservation
- Data preservation is a hot topic and probably a requirement for future projects
March: SA3 workshop?

Ian Bird's remarks

If we want to attract new money, we need to do multi-community work (attract other sciences)
- Require adopting standard protocol
- Horizon is 2020: need a grand challenge at this level
We cannot compensate the people we we'll lose: need to prioritize what we want to continue, if we continue integration, we need to stop someting else

Ian Fisk's remark

Still a medium term continuity problem between 2020 and now
Best success of EGI was this multi-community support with a team that took years to build: will take time to recover from the loss if it happens...

Ian Bird: need prioritisation, to lose some tasks, or find other effort within the collaboration and outside CERN

Forthcoming meeting with exp computing coordinators and IT and PH depts

Follow-up discussion with experiments at a future GDB: December?

OSG Update - B. Bockelman

pre-GDB summary - W. Bhimji

Benchmarking tools: several now available, either on the application side (ROOT) or on the storage side (DPM, EOS)

Still have to converge on the top-10 metrics

Storage interfaces: progress on the functionality map and possible alternative to SRM

Very few areas where there is no alternative to SRM
GFAL2 is promising for solving some of the issues
SURL-TURL translation: redirection is required to replace SRM in a sustainable way (with hacks)

Network WG - M. Ernst

April C-RRB report encourages to continue implementation of an intelligent data management allowing efficient and cost-efficient access to data.

In particular impact on network bandwidth should be considered

WG created to be a forum between application developers and network experts/providers

Not existing so far: LHCOPN is only about infrastructure, LHCONE and similar projects multiplying the places where there was a dialog (not always coordinated) between experiments and big sites
Within LHCONE network providers often make assumptions on what experiments are trying todo
- Their main reference remains the "Bos/Fisk note"...
WLCG is primarily concerned with CPU and storage
- Propose WG conduct a study on network requirements
Experiments must realize that network is not a passive pipe
- Network providers are looking to engage us in traffic engineering
- Dynamic circuits and other intelligent services require active participation of applications to provide additional B/W to high-piority data

WG focusing on network infrastructure and network services: aims at finding common solutions across experiments and if possible agree on a strategic direction

Address functional and perf issues vertically in the network stack
- Examples shown using atlas where network is used in brokering decisions or for better placement decisions
Interface between LHC community and network providers
Voice LHC common position in a coherent way

Network controlled by applications is the next step: Software Defined Networks

OpenFlow is the current solution adopted by several major players (Google, network equipment manufacturers)
Is LHC community just waiting or trying to participate in this development?
- Researchers eager for dialogue now

Virtualized Network Infrastructure for Domain Specific Applications

Network abstractions is a key research topic.

Network community is eager to see LHC community involved

How to profit from LS1?

WG members

Experiment representatives nominated
Sites: in progress through CB first, suggest rotation to foster diversity
Network providers: hard to identify who is the body to contact...
- Here also rotation is probably a good thing

Meeting on Intelligent Network Services at CERN on Dec. 13

Snowmass 2013 has computing and networking as an explicit topic/subtopic

Snowmass is tasked to charter HEP for the next 10 years in US...
Michael co-chair of the network subgroup

-- MichelJouvin - 15-Oct-2012

Topic revision: r6 - 2012-10-17 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback