Summary of GDB meeting, November 4, 2015 (CERN)

Agenda

https://indico.cern.ch/event/319753/

Introduction - M. Jouvin

Next GDBs: see Indico

  • Subject to modificatons by new GDB chair in 2016, depending on GDB evolution
  • The same applies to pre-GDB
  • December pre-GDB = DPM workshop

WLCG workshop in Lisbon (February 1-3): registration open, draft program available

RDA (Research Data Alliance) and DPHEP: (information received from Jamie)

  • See slides for details, in particular SWOT analysis of RDA: next plenary (conference) beginning of March in Tokyo, Fall plenary in Barcelona
  • DPHEP: new Blueprint being finalized to be presented at the February workshop
    • Also discussiong the possibility of a DPHEP workshop co-located with CHEP in San Francisco

Actions: see action list

  • Volunteer sites urgently needed for Machine/Job Features

Forthcoming meetings

  • EGI Community Forum, Bari, November 10-13
  • SuperComputing, Austin, November 15-20
  • DPM workshop, CERN, December 7-8
  • WLCG Workshop, Lisbon, Feb. 1-3›Followed by DPHEP workshop Feb. 3-4
  • HTCondor European Workshop, Barcelona, Feb. 29-March 4
    • Including ARC CE workshop
    • Indico event should be ready soon
  • ISGC 2016, Taipe, March 13-18›http://event.twgrid.org/isgc2016
    • Call for Paper extended to November 18
  • CHEP, San Franciso, October 10-14
    • WLCG workshop the week-end before (Oct. 8-9)

GDB Chair Election

2 candidates: Ian collier and Mattias Wadenstein

Low vote participation: a problem to be understood, a message?

  • May need to rediscuss the voters for the GDB chair election: a vote by country may be no longer relevant...

Ian Collier elected

  • I. Bird: let I. Collier decide whether he wants Mattias to be a deputy, may make sense

WLCG Network and Transfer Metrics WG - M. Babic

Mandate

  • Ensure all relevant network and transfer metrics are identified, collected, published
  • Ensure sites and experiments better understand and fix network issues

Monthly meetings, good progress

  • Baselining: understanding slow transfers
  • Integrating network metrics with the experiments
  • Good and successful collaboration with OSG and ESnet

perfSonar deployment has a been a core activity

  • 245 active/up-to-date WLCG instances over 278 registered
  • WG monitoring instances problems

Metrics made available

  • OSG datastore gathers all metrics from all instances and made them available through esmond
  • CERN ActiveMQ publishes all the metrics too

Prototype proximity service (based on GeoIP) to handle topology-related requests related to metrics

Application integration

  • FTS: performance studies
    • Augments FTS metrics with perfSonar traceroute metrics
    • Identifies links with low transfer rate
    • Demonstrated that because of SRM overhead it was better to multiplex more transfers on a reduced number of streams rather than the opposite
  • Integration with pilot infrastructure in Atlas and LHCb
    • DIRAC: in addition to performance, intent is to use perfSonar metrics to commission links
    • ATLAS: help to make a better usage of the network to reduce pressure on storage, in particular for dynamic/remote access. Idea is to build a source/dest cost matrix (with ElasticSearch) that could be used as an input by PanDA

A GGUS support unit created by the WG to troubleshoot network problems

Meshes

  • 2 main meshes: LHCOPN and LHCONE
  • IPv6 mesh: for dual stack instances and comparing performances with IPv4
  • Belle2: for Belle2 sites

Next steps

  • Progress with pilot projects in collaboration with experiments
  • Integration with high-level services like MadAlert and PuNDIT
  • Easy integration with analytics platforms

Discussion

Marian:

Simone:

  • to reduce authN/authZ overheads, could we look into using GridFTP directly where today the SRM is still used?
  • it would also allow authenticated sessions to be reused
  • some SE types (e.g. DPM) should support that through GridFTP v2
Michel:
  • dCache got there first
  • DPM 1.8.9 had a problem in that area that was fixed in 1.8.10, recently released
Mattias:
  • sessions can also be reused for the SRM, when https is used instead of httpg
    • the latter protocol includes an automatic delegation that is only relevant for srmCopy (not important to WLCG)
  • not clear which implementations beside dCache support https
Michel:
  • the FTS can choose the best protocol per use case
  • the protocol zoo should be further reduced
Mattias:
  • currently SRM is at least needed for staging from tape
Michel:
  • that need may go away when tape endpoints are separated from disk endpoints everywhere

Clouds

Cloud Resource Reporting - L. Field

TF mandate: ensure that the cloud resources can be reported and thus counted against the site pledges

Proposal made in the documented published last Spring

  • Build upon APEL framework
    • Publishers for main cloud MW
    • Include HS06 normalization factor
    • Tools to verify that information is published and correct
  • EGI accounting portal producing aggregates
  • REBUS extension to include information from the portal

Most of work completed.

Next steps:

  • Identify site wishing to integrate their private clouds
  • Extend REBUS reports to include the information

Recommend to terminate the TF!

Discussion

Simone:

  • validation of the accounting data is hard: shouldn't the TF also look into that?
Laurence:
  • the TF can work with a few sites to validate their initial data
  • in production the validation must be a continuous activity to avoid decay
  • other cloud systems can also be supported, they just need to have their own publisher
Alessandro:
  • the numbers need to be compared against the client-side accounting
  • therefore the experiments need to be involved
Maarten:
  • these activities may best be handled by a new WG under Ops Coordination
Michel:
  • Need a follow-up in the coming month else the work done so far may be useless
  • Need to identify volunteers no only in sites but in experiments: clouds make more complex matching the VO view (jobs) and the site view (VMs)

Second Price Enquiry - L. Field

CERN cloud procurement activity starting in the context of Helix Nebula to support CERN scientific program

  • Evaluate cost and benefit of IaaS
  • First procurement: March 15
    • 1 VO, simulation jobs, 2K cores
    • See presentation at GDB earlier this year
  • Second procurement just done: October 15
    • multi-VO, simulation jobs, 4K cores
  • 3d procurement planned in March 16 as part of the HNSciCloud PCP project

Experience gained from 1st procurement

  • Client-side accounting is crucial: validation of invoices, Ganga repurposed for accounting of used resources, data analytics
    • Probably need a more tailored solution than the Ganga-based one for accounting of used resources
  • Benchmarking is also crucial to commoditize cloud resources: enable a "cloud exchange", required for procurements
    • Need for a fast benchmark

2nd procurement: a small delay for starting the real work, due to GEANT delays, originally planned 1st of October, should start soon

  • GEANT delays: procedural delays due to the new use case for GEANT
  • Open to CERN member states, 1 or several firms
  • 1K VMs of 4 vCPUs, 8GB memory, 100 GB storage
    • Input/output from/to CERN EOS
  • Several provisioning APIs accepted (all main ones)
  • Benchmarking based on ATLAS KitValidation suite
    • Open-source, easy/quick to run, reproducible (fixed seed for random generators)
    • 2 thresholds: rejection (1.5 evts/s), compensation (1.2 evts/s)
  • Service delivery coordinated by IT WLCG Resource Integration Team, daily meetings
    • Based on a "capacity manager" component: in charge of provisioning VMs for VOs based on demand
    • VM reports to monitoring and data analytics that feed alarming systems
  • Experiment readiness/use cases
    • ATLAS: running 4 core VM in multicore mode
    • CMS/LHCb: 4 core VMs in single core mode. For CMS, handled by HTCondor which sees 4 slots and provides 4 single core jobs. For LHCb, 4 pilots run on the VM.
    • ALICE: traditional single core VM

Discussion

Jeremy:

  • can you expand on the benefits to the experiments?
Laurence:
  • basically, what was done with our investment? for example, the number of events per dollar
Luca:
  • can you expand on the network connection costs?
Laurence:
  • the costs are for the provider to sort out before they make their offer
  • the delay was procedural this time, as GEANT needed to assess the new use cases
  • for the 2nd price enquiry we required connectivity to GEANT with public IP addresses
  • for the 3rd, we may specify only the necessary bandwidth
    • we should also be fine with private IP addresses at some point

HNSciCloud PCP Project Status - B. Jones

Follow-up for the early CERN experience with procuring commercial cloud services

  • No longer CERN-specific
  • In the framework of H2020 as Pre-Commercial Procurement: well defined process for procuring services
    • EC funding: 70%
    • Buyers committed 1.6 M€ generating a total budget of 6 M€: manpower, application and data, IT services....
  • Starting January 1st, running 2.5 years
    • Kick-off meeting at EMBL (Heidelberg, January 19-21

First phase (9 month): preparation

  • Tender preparation: requirement analysis, building stakeholder group, develop tender material
  • PCP must organise a dialog with potential suppliers to ensure that what will be requested is feasible: will be done in the context of Helix Nebula
    • Defining the requirements is the responsibility of the procurers only: suppliers have no say

Second phase (15 months): implementation

  • Competitive phase
  • 3 phases
    • Design phase: >=3 solutions, 4 months, <= 15% tender budget
    • Prototype development: >=3 solutions, 6 months, <= 25% tender budget
    • Deployment: at least 2 solutions, 5 months, >= 60% tender budget

3rd phase (6 months): pilot and evaluation

Management structure/bodies in place, most people nominated

  • Evaluation committee: in charge of defining criteria who will be published withe the tender, both administrative (exclusion criteria) and technical (award criteria, what defines the best technical proposal)
  • User board: will need to be endorsed by Collaboration Board and members will need to declare conflicts of interest

Project required derogation to CERN procurement rules to comply with EC ones: agreed in Sept. by CERN council

Cloud resources will be made available to user groups during the pilot phase

  • WLCG, ELIXIR, EGI Fed Cloud, local users at each procurer sites
    • WLCG: can be counted against site pledges

HNSciCloud requires WLCG to define the use cases that we want to run on these resources: avoid doing it by experiment, rather common generic use cases

  • Need a small group to work on this

T0 Update - H. Meinhard

Cloud resources increasing and will continue to increase

  • Doubled in the last year
  • Operating system upgrade from SL6 to CC7 on hypervisors started and expected to be done in the next 6 months
  • CPU performances: significant progress in understanding the various issues in the last months. Now have a configuration where the inefficiency is less than 5-6% (target is < 5%)
    • One problem (NUMA-awareness) related to OpenStack and fixed only in the version currently being deployed (Kilo=
    • Pre-deployment tests are not sufficient
    • Different optimisations for different HW types: importance of the monitoring infrastructure
    • Details presented at the last HEPiX
  • 130K cores

Started to look at container support integration in OpenStack

  • Currently OpenStack support is mainly for LXC
  • Started to look into OpenStack Magnum new component providing container orchestration: will bring support for Docker through Kubernetes

Data services: CASTOR and EOS

  • CASTOR only in Meyrin, EOS evenly distributed between Meyrin and Wigner
  • EOS: 80% of CERN disk space
  • ATLAS and CMS now writing RAW data directly into EOS and move it to CASTOR in a second step
    • ALICE and LHCb kept the original, opposite model with LHCb accessing all data from EOS after copying it where ALICE is doing the reconstruction with CASTOR
  • EOS peak: 30 GB/s (CASTOR only half of it)

Networking

  • Reconfiguration of datacenter network planned in February to reduce the number of machines exposed to LHCOPN/LHCONE
    • For security reasons
  • 3d 100G link between Meyrin and Wigner soon available (December or January), through the Nordic countries
    • The conjuncton of a traffic sometimes above 100 Gb/s and a good offer: will allow to maintain a resilient capacity of 200 Gb/s

Batch

  • LSF upgrade from 7 to 9 in progress: master done, WNs to be done (following the usual staged/QA deployment)
  • HTCondor in production since the beginning of the week (Nov. 2): in fact no change to the already existing configuration, except that accounting is reported to the production instance
  • The 2 ARC CE originally set up about to be retired in favor 2 HTCondor CEs to reduce the number of products run
    • where HTCondor is already used as batch system, trading the ARC CE for the HTCondor CE reduces the amount of middleware dependencies
    • Brian Bockelman and Iain Steers decoupled the HTCondor CE from its original OSG packaging
    • Detailed presentation of HTCondor at last HEPiX
  • Progressive transition from LSF to HTCondor
    • Currently: HTCondor 4% of total batch
    • Will move WNs by step of 500-1000 core weekly up to 25% (the percentage of grid submission in the total batch)
    • Further migration after Krb ticket handling is working: a prototype expected by the end of 2015. Will include training and appropriate documentation.
    • Target: no LSF-based service at the end of Run 2

MW: recent issues with FTS and ARGUS

  • Good collaboration with developers
  • ARGUS: mostly not related to ARGUS itself

Volunteer computing: CMS now using it at a high-level, ATLAS continuing its already established usage

Gitlab service in production, coupled with Continuous Integration: preparing to phase-out the current gitolite-based service as well as SVN

  • Very fast take up: already 1500 projects
  • gitolite in the short term
  • SVN: rundown targeted some time during LS2

Fall HEPiX Summary - H. Meinhard

Real workshop with honest exchange of experiences

  • Status, work done, work in progress, future plans

New web site hosted by DESY

Last workshop at BNL: 110 people, 32 affiliations, 82 contributions

Storage and file systems

  • Ceph increasingly used
  • AFS: CERN moving away from it for home directories in favor of EOS + CERNBOX

Grid & clouds

  • Preliminary work to support Docker at several sites
  • OpenStack everywhere
  • Increasing interest/activities around commercial clouds

IT facilities

  • New datacenter entering production at LBNL
  • BNL proposing a new datacenter

Computing and batch systems

  • HTCondor use increasing: active participation of HTCondor development team
  • NVMe drives: next generation SSD?
    • Bypass SATA interface for improved performances
  • Discussion on benchmark standard

Basic IT Services

  • ELK (or variants) at many places
  • InfluxDB + Grafana: an alternative to Ganglia?
  • Puppet at many sites, Quattor still has an active community

Next meetings

  • Spring: April 18-22, DESY Zeuthen (Berlin)
  • Fall: LBNL (Berkeley), probably the week after CHEP (Oct. 17-21
  • Several proposals to host the meeting received from Europe, NA and Asia: may revise the current cycle Spring Europe/Fall North America
    • 2017: Spring probably in Budapest (Wigner), Fall probably at KEK

CPU Accelerator Support in EGI-Engage - M. Verlato

Task part of WP4/JRA2: main partners INFN and Slovak Academy for Informatics

  • Expose the appropriate information into the information system
  • Extend the submission interface to properly handle these resources
  • Do it first for grids, then for clouds
  • Developed a GPU-enabled portal for MoBrain
    • AMBER and GROMACS applications

Work plan until May 2016

  • Identify the relevant batch system parameters related to accelerators and abstract them into JDL
    • Restarted from the work done in EGI-Inspire
    • New JDL attributes: GPUMode, GPUNumber
  • Implement needed changes into CREAM CE and BLAH
    • A first prototype deployed at CIRMMP
  • Write info providers for GLUE 2.1
    • How/Whether to report free GPU resources still being discussed
    • Relying on current info provider maintainers for each batch system info provider
  • Accounting: discussion going on with APEL team
    • Torque and LSF logs/accounting don't have the GPU usage information

Several RPs interested to participate but only covering LSF, SGE and SLURM: an HTCondor site would be very welcome

  • Also interested to get access to a site providing Xeon Phi rather than GPU

Cloud work/testbed by Slovakia Academy for Informatics

  • Based on OpenStack Kilo
  • Demonstration planned at EGI CF in Bari next week

Discussion

Michel:

  • it would be good to have HTCondor represented in the set of batch systems

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2015-11-05 - MichelJouvin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback