Summary of GDB meeting, October 10, 2012 (LAPP, Annecy)

Agenda

http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=155073

Introduction

  • CERN room for upcoming meetings uncertain: be sure to check before the meeting
  • Clash in April: GDB / egi community forum. Move the week before?
  • details of forthcoming topics in GDBs, slides for details

EMI-2 Transition

Moving to Supported MW - M. Schulz

EGI started to open tickets against sites running gLite 3.1 services and gLite 3.2 unsupported services (BDII, CREAM)

  • 75 sites still running gLite 3.1 services
  • 163 sites run gLite 3.2 CREAM or BDII
  • No expertise left on how to build a patch for these services

Move by end of November requested by WLCG for all unsupported services (3.1 and 3.2)

  • No real difficulty, in particular for CE: can start a new one in //
    • Still a problem for the CREAM CE with Torque: need to upgrade WN at the same time, must wait end of validation
  • All other 3.2 services (eg. DPM, LFC) to be upgraded by end of 2012
    • Their end of support is end of November, so at risk after this date

Retirement of gLite 3.1 and unsupported 3.2 - T. Ferrari

Basically all products, except WN/UI, LFC, DPM, glexec

Many broadcast to sites.

Security dashboard setup as part of the Operations Portal: daily progress reports by NGI

  • Sites still running obsolete services are in violation of ops policy

Starting beginning of November, open tickets will be escalated to EGI CSIRT

  • Sites will be eligible for suspension if they don't respond or if there is no documented plans and timelines for upgrade
    • An answer explaining the technical reason for the delayed upgrade may be considered valid
  • In case of high security risk, sites will be asked to decommission the affected service in the next 24h

Don't forget that EMI-1 end of life is 30/4/2013: recommended migration is to EMI-2

  • Anyway, SHA-2 issue requires complete upgrade of the infrastructure to EMI-2/EMI-3 by Aug 1, 2013

EMI discussing a retirement policy where sites will be required to decommission service no later than 1 month after its end of life.

  • Will probably come into force after end of life of EMI-1

EMI-2 WN Validation - M. Litmaath

3 wiki pages: WLCG, Atlas, CMS

  • 6 sites covering different SEs, mainly SL5, 2 with SL6 WNs
    • 2 sites (CNAF and BRUNEL) is EMI-1 which is valid for the short term
  • CNAF is interested to participate to SL6

ATLAS Status

  • Main issue is gsidcap access failing with limited proxies: ticket opened for dCache
    • CMS has not seen/reported the same pb but they seem not use gsidcap but only dcap
    • Even not all ATLAS dCache sites

CMS Status

  • CASTOR RFIO and dCache dcap are ok with both CMSSW_4_4_5 and 5_3_1
  • 4_4_5 requires a hack for DPM RFIO
  • dCache gsidcap requires a hack for all versions
  • Hacks considered as tolerable
  • GPFS validated too

WLCG page will be updated with the matrix WN version vs. SE implementation

  • Sites should find themselves in the matrix to understand if they can upgrade

A WN update is in progress to include outstanding fixes.

Suggestion: keep a set of test queues to exercice new versions of WN that are exercised by HammerCloud with representative jobs

Conclusion: no reason preventing upgrade of WNs after the next EMI release expected on Oct. the 22nd

  • Sites can already plan their upgrades for this

Discussion

Make clear to sites what is the decision

  • They are required to have plans by '''beginning of November''' and communicate them to EGI on request (tickets)
  • WLCG expects plans to be enforced by '''end of November'''

discussion - WLCG message, upgrade by end November - EGI message, give us your plan or risk suspension from 1st Nov these are consistent

Note the SHA-2 issue, which requires complete upgrade of infrastructure by 01/08/2013

Borut: ATLAS agrees with plan as long as a significant T1 downtime is not involved

  • Need coordination to ensure that not all T1s are updating at the same time

Software Lifecycle Process after EMI - M. Schulz

A draft is available, attached to the agenda, summarizing the discussions in the last 3 months.
  • Starting point for further discussions

Proposal summary (see slide 3)

  • Roadmaps and priorities via pre-GDB
    • Requirements via GGUS
  • Independent product teams
  • Integration with EPEL
  • Repositories: UMD/WLCG, EPEL-Stable, EPEL-Test, Legacy
  • Move to a collaboration-based contribution for configuration, packaging, batch support
  • Foster collaboration with EGI to avoid divergence
    • In particular for rollout

Building tools: no requirement for ETICS

Ian: WLCG statement required endorsing this

  • To be discussed at a future MB

ALICE SW Distirbution Infrastructure - C. Grigoras

Move to Torrent to overcome problems (stability and security) with previous shared SW area
  • Pre-config of sites preventing opportunistic uses
  • SW version management complexity
  • NFS scalability
  • VOBOX security risk

build populates alien file catalogue and repo, which is then seeded.

Torrent ALICE configuration

  • Tracker and seeder run centrally at CERN
    • Fed directly by build servers
    • Connection to the world limited to 50 MB/s
  • No intersite Torrent communication: only package exchange is between WN and WN and CERN seeder
  • SW repository ~150 GB for all versions
    • ~300 MB download by each job

Job workflow

  • Download of a bootstrap script (pilot job)
  • Get a job from the task queue
  • Download required packages for the job

How peers are discovered

  • Clients publish their private IP in the tracker: allow exchange with private IPs on the same network
  • Multicast to discover peers on the same network

Firewall config: documented, see slides in particular

  • Outgoing to alitorrent.cern.ch on bittorrent ports
  • WN to WN
  • See presentation for details

Status

  • Already the 2nd year for larger sites, eg. CERN
  • One month ago, permission asked to each site to use torrent: very few refused
  • Working with the remaining sites to solve the (mostly) non-technical issues
  • T1: 5/6 migrated, T2s: 36/78 migrated (mostly last month)
  • Aiming for full migration before next AliEn version is deployed: would like to drop Packman service on VOBOX
    • This will remove the request for space on VOBOX

CNAF reports an issue with short-lived jobs which flooded their site causing network saturation and impact on CVMFS (high disk load)

  • Was due to a problem, now fixed by ALICE

Federated Storage WG - F. Furano

Main experiment expectations
  • Easy/reliable direct access to data
  • Coherent file naming with access to everything from everywhere
    • Including WAN access

Good agreement on the vision, more a problem of difference of the way this is provided to applications

Agreement to separate federation implementation from what applications want to achieve

Possible applications

  • Fail-over: client is redirected to another location
  • Self-healing: the storage corrects the problem (missing file), transparently to apps
  • Site analysis cache: T3s don't want to preplace files but want them to remain in place as long as they are used
    • Often repeated access to the same files
    • Could participate to the federation for the files they have in cache (until they are expired)

Tightly coupled to experiment namespaces and catalogues: specific to each experiment

  • ATLAS specific issue with LFN->SFN translation because they decorated/mangled SFN with site specific tokens
    • Currently (FAX prototype), they wrote an Xrootd plugin to ask LFC on the fly to get the SFN

Security: make the data of a storage federation readable to anybody in the collaboration

  • Authentication is still required...

Raising needs of standard tools to access the federation to avoid duplication by each exp with risk of incompatibility

Xroot and http/Dav are the main candidates: experiments should coordinate to maximise the chance of http to be adopted

Prototypes available both for ATLAS (FAX) and CMS (AAA)

  • Most of the data about to be available through these federations

Plan to make a tech review/monography of potential technologies: Xroot, http

Discussion

  • Philippe argues LFC constitutes a federation: nobody disagrees but WG intends to do more...

Actions in Progress

Operations Coordination - M. Girone

Kicked off on Sep. 24
  • Internal task forces are the driving force

Fortnightly meetings on 1st and 3d Thursday of each month

  • Daily meetings remain

Good representation of experiments and sites

  • T1 representatives: some missing, would be great to have all of them
  • EGI, EMI and OSG represented

Mailing list wlcg-operations@cernNOSPAMPLEASE.ch will be filled with all contact names registered in GOCDB

Quarterly Operations planning: a meeting to prepare plans and propose them to MB

Task forces currently created

  • CVMFS deployment
  • glexec deployment
  • perfSONAR deployment
  • Tool evolutions
  • Squid monitoring
  • FTS3 validation and deployment
  • SHA-2/RFC proxies
  • ... see presentation

WN Environment Information - U. Schwickerath

Proposal on Twiki: almost final, feedback possible/welcome

2 environment variables set for each job

  • $MACHINEFEATURES: points to a directory describing machine environment like hs06, shutdowntime, jobslots, number of cores
    • One file per information
  • $JOBFEATURES: points to a directory giving information specific to the current job (cpu/wallclock limit, disk limit, mem_limit, allocated_CPU...)
    • One file per information

Implementation availability

  • Torque 2.5/PBS 2.3: maintained by NIKHEF, probably not implementing the last specification
    • To be checked by Jeff
  • PBS Professional and Grid Engine: KIT, up-to-date
  • LSF: Ulrich, up-to-date, in production since last Monday
    • Several hacks to work-around the fact that LSF cannot be queried directly: may require some tweaking at other sites

Discussion on validation strategy

  • Atlas (Simone): there are things atlas could use straight away

HUC Support after EGI-Inspire SA3 - J. Shiers

HUC in EGI-Inspire (SA3) = HEP + Life Sciences (5 VOs) + Astro-physics + Earth Science

  • HEP = the only world-wide community, 93% of consumed CPU time
  • SA3 was a 3 year effort in a 4 year project
  • End affects more than just HEP

Main SA3 goal: transition to sustainable support

  • Identify tools that benefit several communities
    • Original plan was to move these tools as part of the core infrastructure, but it was understood it is better to have them maintained by those who use them

Achievements:

  • Non-distruptive transition from EGEE to EGI during LHC data taking
  • Continous improvement
  • Rapid fixing of problems
  • Identification and delivery of common solutions
  • All a direct benefit of a multi-year and multi-community effort: not to be forgotten when looking at the future

Today: less incidents than in the past, lower impact of those incidents on experiment work

EGI review stated SA3 had exceeded its objectives.

IT-ES (Experiment Support) is a unique resource in WLCG, with addition of a few people from other groups (in IT and PH)

  • Largely funded by EGI-Inspire SA3
  • 10 people with advanced skills leaving end of April...

Sustainability plans for SA3: see slide 30 (taken from sustainability section of last deliverable)

  • Persistency frameworks: POOL by experiences who use it, COOL/CORAL by CERN IT
  • Data analysis tools: long-term effort, outside SA3 scope
  • Data management tools: long-term support taken over by PH department
  • Ganga: CERN will stop its involvement a few months after SA3 end, to be picked up by other institutes that are members of the project
  • Experiment dashboard: all requested features delivered, long term support covered by CERN-Russia and CERN-India agreements

FP8: first call expected in 2013 with first funds in early 2014

  • Calls related to data management and data preservation are very likely
    • We need to decide if we want to be part of these projects: hope yes
  • Will not solve the problems directly related to experiment support nor address the gap between SA3 and these projects

Discussion with OSG/US about an IETF-like task force around data management

A few meetings on the road to build consortiums to apply to future calls

  • Barcelona (October): open access and data preservation
    • Data preservation is a hot topic and probably a requirement for future projects
  • March: SA3 workshop?

Ian Bird's remarks

  • If we want to attract new money, we need to do multi-community work (attract other sciences)
    • Require adopting standard protocol
    • Horizon is 2020: need a grand challenge at this level
  • We cannot compensate the people we we'll lose: need to prioritize what we want to continue, if we continue integration, we need to stop someting else

Ian Fisk's remark

  • Still a medium term continuity problem between 2020 and now
  • Best success of EGI was this multi-community support with a team that took years to build: will take time to recover from the loss if it happens...

Ian Bird: need prioritisation, to lose some tasks, or find other effort within the collaboration and outside CERN

  • Forthcoming meeting with exp computing coordinators and IT and PH depts

Follow-up discussion with experiments at a future GDB: December?

OSG Update - B. Bockelman

Recent news
  • Alan Roy left OSG
    • Tim Cartwright took over the interim lead
    • Hiring a new person should happen in a few months after refining the position

Release cycle: 2 releases a month, one "major/significant", the other one for fixes

  • Current release : 3.1.10
  • 51% CE running OSG3 in 1 year but 49% still on OSG1.2 which is on "critical security updates only"

Worked to improve the coordination with Globus team

  • Currently applying 30-40 patches on top of Glous code: would like to lower this number
  • 4 joint projects related to GRAM
  • Now have commit rights to Globus repository
    • opportunity for European patches to Globus (contact Brian)

SHA-2 updates: OSG trying to keep the internal deadline of Jan 1, 2013

  • Has been a good exercise: beginning to see the light at the end of the tunnel
  • Known minor issue with Gratia server
  • Xrootd not yet checked and probably broken
  • Central service survey not yet complete but no major problem expected
  • Having access to jglobus2 repository: lot of patches merged
    • Performance seems acceptable for the server but quite noticeably slower: clients not heavily used (apart dCache)
    • jglobus2 can work with SHA-1 certificates and support legacy/pre-RFC/RFC proxies: may allow to decouple RFC and SHA-2 proxy support
    • Tested client/server backward compatibility with older jglobus: non trivial fixes required
    • Basically provides BeStMan2 and dCache with a simple path forward
      • Maarten (after the meeting): the dCache developers warn against being too optimistic here! They will investigate if the improved JGlobus-2 is sufficiently OK for the dCache deployments and use cases that we need to care about. To be continued...

Info service

  • So far neither USCMS nor USATLAS have identified any BDII usage they expect to continue on the long term
  • Recently, after discussion with O. Keeble, decided that FTS3 would be better-off using MyOSG info services instead of BDII
    • Used in fact only for monitoring
  • Hence, no plan to add GLUE2 to the information published into the BDII: controversial but matches well local requirements and effort level
    • Impact is not clear on European side usage of BDII
  • Tiziana: GOCDB planning to rely on GLUE2 interface with XML rendering. Is it something considered by OSG?
    • To be rediscussed

Multicore support: working on implementing GRAM multicore interface

  • Aware of similar work for Condor
  • Not as used as hoped: 1/2 T2s have support enabled

WN environment variables: "watch and wait" for what happens with the proposal

CE strategy

  • Worried about being tied to one particular technology: want to pursue several directions
  • Continue to shine and polish GRAM
  • One BNL-based team looking at CREAM: EMI-2 version in OSG testing repo
  • Condor-CE also being looked at: still a few months needed before having something releasable

pre-GDB summary - W. Bhimji

Benchmarking tools: several now available, either on the application side (ROOT) or on the storage side (DPM, EOS)
  • Still have to converge on the top-10 metrics

Storage interfaces: progress on the functionality map and possible alternative to SRM

  • Very few areas where there is no alternative to SRM
  • GFAL2 is promising for solving some of the issues
  • SURL-TURL translation: redirection is required to replace SRM in a sustainable way (with hacks)

Network WG - M. Ernst

April C-RRB report encourages to continue implementation of an intelligent data management allowing efficient and cost-efficient access to data.
  • In particular impact on network bandwidth should be considered

WG created to be a forum between application developers and network experts/providers

  • Not existing so far: LHCOPN is only about infrastructure, LHCONE and similar projects multiplying the places where there was a dialog (not always coordinated) between experiments and big sites
  • Within LHCONE network providers often make assumptions on what experiments are trying todo
    • Their main reference remains the "Bos/Fisk note"...
  • WLCG is primarily concerned with CPU and storage
    • Propose WG conduct a study on network requirements
  • Experiments must realize that network is not a passive pipe
    • Network providers are looking to engage us in traffic engineering
    • Dynamic circuits and other intelligent services require active participation of applications to provide additional B/W to high-piority data

WG focusing on network infrastructure and network services: aims at finding common solutions across experiments and if possible agree on a strategic direction

  • Address functional and perf issues vertically in the network stack
    • Examples shown using atlas where network is used in brokering decisions or for better placement decisions
  • Interface between LHC community and network providers
  • Voice LHC common position in a coherent way

Network controlled by applications is the next step: Software Defined Networks

  • OpenFlow is the current solution adopted by several major players (Google, network equipment manufacturers)
  • Is LHC community just waiting or trying to participate in this development?
    • Researchers eager for dialogue now

Virtualized Network Infrastructure for Domain Specific Applications

Network abstractions is a key research topic.

Network community is eager to see LHC community involved

  • How to profit from LS1?

WG members

  • Experiment representatives nominated
  • Sites: in progress through CB first, suggest rotation to foster diversity
  • Network providers: hard to identify who is the body to contact...
    • Here also rotation is probably a good thing

Meeting on Intelligent Network Services at CERN on Dec. 13

Snowmass 2013 has computing and networking as an explicit topic/subtopic

  • Snowmass is tasked to charter HEP for the next 10 years in US...
  • Michael co-chair of the network subgroup

-- MichelJouvin - 15-Oct-2012

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2012-10-17 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback