Summary of GDB meeting, May 13, 2015 (CERN)

Agenda

https://indico.cern.ch/event/319747/

Introduction - M. Jouvin

WLG workshop

  • Volunteers for hosting the next WLCG workshop should get in touch with Michel or the WLCG Operations Team
  • Dates not yet decided but probably around the end of the year

BELLE2: computing infrastructure very similar to WLCG, requested during last WLCG workshop to become an observer within WLCG: first answer has been that they are welcome to participate to the GDB.

    • BELLE2 computing coordinator will be at CERN in June

Next GDBs and pre-GDBs

  • October GDB: HEPiX finally agreed on a coorganized grid/cloud day during HEPiX at BNL.
    • An occasion to foster links between GDB and HEPiX
    • Sites encouraged to participate, in particular from OSG: a good occasion to increase North America participation to both GDB and HEPiX
  • June pre-GDB: not plan materialized yet, still the possibility of a TF meeting (Traceability WG, ARGUS...)
  • Send Michel suggestions for topics if you have any

Actions in progress:

  • Machine Job Features: Lack of early adopters. Jeremy Coles reports that there is lack of documentation to be able to implement this. Michel suggests to get in touch with Stefan to understand what needs to be done.
  • Multicore accounting: John Gordon will send an update on the mailing list about the remaining site not yet publishing the core information.

Middleware: OS version support - P. Solagna

SL5: 100 entries over 600 (15%) in the BDII...

  • A good fraction is disk servers... but not only
  • EGI monitoring services just migrating to EL6

Support to EL7 in UMD more or less requires decomissionning EL5 support

  • Mainly a manpower issue

13 products ready for EPEL7 (see slides for list)

  • But some PT are waiting for a clear requirement from users and stakeholders: in particular CREAM and ARGUS

NGI feedbacks: a few sites want to keep some SL5 services and some others would like to deploy EL7 WNs...

  • Specific problem: disk server reinstallation. Not necessarily straighforward, HW upgrade required by newer OS versions

Need also to take into accounts cloud site requirements: EL OS is only 60%, others Debian based

  • Ubuntu is the number 1 non EL OS in EGI

EGI planning a new major UMD release: UMD-4

  • EPEL7/EL7 products
  • EL6 products only when there is a backward compatibility issue with UMD-3
  • UMD-3: only product fixes for EL6, security updates for EL5
  • Proposed timeline
    • 1st release of UMD-4: September 2015
    • SL5 decommissionning: March 31, 2016

UMD/EMI repositories

  • EMI repos decommissioning planned for Sept. 2015
  • UMD-preview: monthly updates from the product teams, no quality control in addition to validation done by PT
  • EMI 3d party repo will be provided by an AppDB (YUM-based) repo
  • Sites encouraged to use the new official UMD repos

Discussion

  • WLCG consensus: we need all PTs to deliver a EL7 version asap with expectation to have the full stack available around the summer
    • PTs should not wait a request from experiments saying they want it tomorrow...
  • Peter: EGI also wants to see EL7 ready soon to allow sites still running EL5 services to migrate directly to EL7 and avoid one intermediate upgrade
  • Michel: need to begin discussing with experiments the possible timeline of an EL7 WN deployment in WLCG
    • Stress the importance of having an EL7 WN available soon: experiments need it to be available to test it, identify and fix problems

WLCG Workshop Conclusions - A. Forti

Run2 and beyond: not too early to begin looking at HL-LHC challenges

Blurring site boundaries: increase flexibility in experiment frameworks but also some difficulties

  • T1s have different SLAs: not just about services
  • FAs may not buy (fund at the same level) a more chaotic model

Need to exploit whatever resources is available: active work in all LHC experiments

  • Even if grid/pledged resources will probably remain the core of the infrastructure

Reconstruction challenge with very high pile-up: common work started between experiments

  • ATLAS and CMS began to work on common algorithms and code
  • Duplication is not affordable: FAs will not fund it

Federated storage in all experiments

  • ATLAS + CMS + ALICE : Xrootd
  • LHCb : Gaudi for input files, Http/WebDav Dynamic Federation for consistency checks
  • Concentrating on failover, diskless slides
    • Need better understanding of possible issues before going further

Storage protocols: agreeement that the zoo should be reduced

  • All experiments planning to abandon SRM for disk access
  • Sites preference for using industry standard rather than HEP specific solution, e.g. http

Simple T2 concept attractive but currently there is little that can be removed from a site....

  • First step concentrate on simpler services like ARC CE
  • Small site with little dedicated manpower may be better to look at "cloudification", including Vac approach

Monitoring: still fighting with view discrepancies between sites and experiments

  • Access to experiment monitoring pages from sites without belonging to the VO: idea of using the DNs registered in GOCDB as the source of people entitled to get access to them

Conclusions

  • Budget reduction affecting not only computing resources but also manpower
  • Medium term: facilitate communication exchange, reduce and simplify services
  • HL-LHC scale: more common solutions at all levels to match application/processing workflows to available resources

NDGF Update - E. Wadenstein

External review in progress about efficiency of the current NDGF distributed model

  • About to hire a person to do it

Storage protocols: ATLAS experimented using SRM+https

  • Side effect: went through IPv6 as most NDGF sites dual-stacked
  • SRM still considered the protocol of choice for negociation/redirection
  • http a bit more efficient than gridftp for small files: smaller protocol overhead
  • But http uploads are more "dangerous" for dCache nodes in case of errors: no control channel to report errors, upload can success from the http point of view but not be properly registered in dCache namespace
    • Some improvements done recently in this area

New interface to TSM (ENDIT) using the native dCache HSM provider: production version expected in a few weeks

  • Presentation planned at dCache worksop next week

IPv6: last site of dCache pools being dual stacked

  • https will use it but not gridftp
  • Enabling IPv6 for storage doable: can easily back it out. But probably not before LS2.
  • Probably not a problem for CEs

Discussion

  • It would be important to have NDGF participating actively to the HTTP Deployment TF of the OpsCoord as the problems mentionned seem an important input: need to understand if this is an implementation problem or something more fundamental and what are the possible solutions.

OpsCoord Report - M. Dimou

MW

  • StoRM 1.11.8 installed everywhere
  • Many dCache new versions and also end of support of 2.6.x
  • dCache: new versions made available in the last 2 months
  • ARGUS PAP 1.6.4 fixing a blocking issue with Java upgrade
  • Torque: avoid upgrading to last EPEL version (v4) as it requires configuration changes
  • New FTS 3.2.33 with better REST API performance and memory usage

Tier 0

  • HTCondor pilot open for grid submission: currently ATLAS and CMS
    • Other experiments: contact T0 managers if interested
  • Investigating possible causes for wrong low WLCG availability figures in March

Experiments

  • ALICE: High activity, VOBOXes moved to RFC proxies, using opportunistic resources, CASTOR instabilities
  • ATLAS: Data taking ongoing, increased job length to improve CPU/WC ratio (avoid setup time), sites requested to provide multicore resources
  • CMS: work with cosmics and MC, experienced network saturation during busy weeks
  • LHCb: stopped writing to LFC on May 11th

TF/WG

  • gLexec: gLexec testing in PANda ongoing. 61 out of 94 sites
  • SHA2 migration: old VOMS server aliases were removed
  • RFC proxies: only ATLAS and LHCb not fully ready (assessed) yet
    • SAM proxy renewal may require a fix but may also be fixed by the UMD3 upgrade of the infrastructure
  • Machine/Job Features: Need more site for validating existing implementation
  • Multicore: testing memory parameters to the batch system
  • IPv6: a lot of progress in the last months
    • FTS3 testbed, several new sites with dual-stacked services, DIRAC readiness assessment (one issue identified)
  • HTTP deployment: kickoff meeting
  • MW Readiness: ATLAS and CMS active (some updates may be needed on their twikis when referring to baseline versions), LHCb and ALICE still invited to join...
    • Pakiti client for MW Package reporter available to volunteer sites
    • Working on automating the baseline table update
  • Network and Transfers: performance follow up procedure in place, instructions on how to properly run perfSONAR toolkit.
    • FTS performance study now integrated into the WG plan
    • new GGUS support unit for network performance issues

OpsCoord working on a workplan for addressing the conclusions of the WLCG site survey

  • Possible areas for improvement: reduction and simplification of services, doc quality and easily accesible, info exchange with the community, adoption common solutions and industry standards

Summary of pre-GDB on batch systems - M. Jouvin

Good participation from sites

HTCondor: significant take-up in Europe but also some sites choosing/having chosen SLURM or UGE

  • Still early adopters: not a complete off-the-shelf solution yet but puppet and quattor configuration available
  • Python APIs: preferred way to interact with HTCondor from configuration/monitoring tools
  • BDII issues
    • CREAM: no GIP plugin available, only an obsolete one from gLite 3.2 and patches from RAL. Exact status to be checked with EGI: some work was supposed to be done.
    • ARC: working support for publication into the BDII, except that VOViews are not published. Need to understand if this is acceptable for VOs.
  • multicore easy to configure
  • memory limitation: agreement from all batch systems that cgroups is the way to go
    • Handle wells multicore jobs
    • Well supported by HTCondor
    • Some known issues, linked to kernel bugs: tend to be fixed quickly in EL7 and generally backport by RedHat done on EL6 (EL) kernels

CEs: possible alternatives to CREAM

  • ARC CE: used by most sites running HTCondor, all happy with it
    • Puppet and quattor configuration available
    • Main issues related to accounting: problem in JURA component, direct publication. Publication through APEL being worked on by GRIF/Irfu.
  • HTCondorCE: new generation CE at OSG, no support for APEL and unclear support for BDII. Too early to be a real candidate but needs to be monitor: attractive as it could allow to reduce the number of products operated at a site.

Accounting:

  • Accounting validation of new CEs and batch systems very important
  • HTCondor not doing CPU time normalisation by itself: 2 different approaches proposed by CERN and RAL, usable by other sites
    • Agreement to share: review the role of each one later

Discussion

  • Maria mentions that HTCondor CEs are indeed published in the BDII.
    • There is an info provider released since February and this is needed by ATLAS to populate AGIS.
    • Michel mentions that this is not clear that this really cover the HTCondor CE: may be just HTCondor as a batch system, despite what suggest the infos into the BDII.
    • all this needs to be carefully reviewed and presented at a future meeting, including the real need for VOViews

HNSciCloud - B. Jones

PCP project: very different from a usual European project

  • Project submitted mid-April: feedback from EC expected next Sept.
  • PCP details: see slides

HNSciCloud coming from several previous experiences and projects

  • HelixNebula: PPP project aiming at linking commercial resources with public resousrces through GEANT
    • Procurement model needs to evolve to include clouds and not only HW
    • procurement of a commercial cloud resources by CERN for a small fraction of ATLAS simulation
    • Several valid proposals received, ATOS awarded (Tenerife datacenter, connected through GEANT)
  • PICSE: started last October, coordination action to survey cloud commercial market

Towards the European Open Science Cloud paper: http://go.egi.eu/osc

  • Hybrid: linking public organizations, e-Infra and commerial resources
  • Trust: research organisations want to keep control of the cloud and their data
    • Data preservation
    • Insulate from technology/supplier changes
  • Economy: must be cheaper that building dedicated infrastructures

Proposed joint PCP by several partners

  • 1.6 ME put by the 7 partners + manpower
    • Project total budget with matching funding from EU: 5 ME
  • Procure innovative cloud services: need to integrate with network, federated ID Mgmt, ...
  • 3 phases
    • Preparation (6 months): analyse requirement, prepare call for tender, partners reconfirming their commitments
    • Implementation (18 monthts): received bidders answers, select >= 3 bidders and ask contractors to build a prototype (functional/security tests), select >= 2 contractors to build a pilot (scalability tests). 60% of the implementation budget for the pilot phase. Pilot phase open to users.
    • Sharing (6 months)
    • CERN is the lead procurer, every partner has a different management responsibility
  • Users eligible to use these resources are WLCG, ELIXIR, EGI FedCloud communities, local users at each buyers site
    • Exact share between different communities decided by buyers but cannot allocate resources only to local users

Next steps

  • CERN will perform another limited scale procurement in preparation for the PCP: to be handled inside HelixNebula
  • If the project if funded, during Fall: prepare the EC contract, adapt CERN procurement rules, refine plans
  • Project expected to start in January 2016

An important challenge: support of data-intensive workloads

  • Will require investigations: one the innovative topics in the project

Discussion

  • Maarten: will CERN have to support random users.
    • Bob explains each partner will be using the resources proportionally to the money they have invested and will decide which communities have access to their share of the procured resources.
    • "Local users" applies only to the partners other than CERN, ELIXIR and EGI. These 3 partners are supporting communities that are not "local" to them and they will have the right to open their share only to these communities.
    • Giving access to a community doesn't specify how this access is provided.
  • Oliver: will tenders explicitly include storage and, if yes, where will be discussed how storage should be delivered.
    • Bob: sure, this will be the case, we have no other option. The specification of the requirements is an ongoing work and not all partners have exactly the same requirements.
    • Also, in terms of networking, only bids that can connect to the GEANT network will be accepted: may have to add more requirements on the provider connectivity...
  • Michel: what about Europe/North America cooperation on this, similar projects on the other side of the Atlantic (presented at CHEP, big cloud players in the US...
    • Bob: Amazon or Microsoft could participate in this project as they have data centers in Europe. To be seen... but several European cloud players not as big but more dedicated to the enterprise market and interested by our market as proven by the Helix Nebula experience (ATOS, T-System...).
    • This project is seen as very interesting because it is open and everyone is invited to participate: it contributes to create a market place.
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2015-05-16 - MichelJouvin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback