LCG Web>WLCGGDBDocs>GDBMeetingNotes20151104 (2015-11-05, MichelJouvin)

EditAttachPDF

Summary of GDB meeting, November 4, 2015 (CERN)

Agenda
Introduction - M. Jouvin
GDB Chair Election
WLCG Network and Transfer Metrics WG - M. Babic
- Discussion
Clouds
T0 Update - H. Meinhard
Fall HEPiX Summary - H. Meinhard
CPU Accelerator Support in EGI-Engage - M. Verlato
- Discussion

Agenda

https://indico.cern.ch/event/319753/

Introduction - M. Jouvin

Next GDBs: see Indico

Subject to modificatons by new GDB chair in 2016, depending on GDB evolution
The same applies to pre-GDB
December pre-GDB = DPM workshop

WLCG workshop in Lisbon (February 1-3): registration open, draft program available

See [https://indico.cern.ch/event/433164/][agenda]]
Contact email: wlcg-lisbon@cernNOSPAMPLEASE.ch
Focus on evolution: last 1/2 day to agree on decisions and actions
Will be followed by a 1 day DPHEP workshop: https://indico.cern.ch/event/444264/

RDA (Research Data Alliance) and DPHEP: (information received from Jamie)

See slides for details, in particular SWOT analysis of RDA: next plenary (conference) beginning of March in Tokyo, Fall plenary in Barcelona
DPHEP: new Blueprint being finalized to be presented at the February workshop
- Also discussiong the possibility of a DPHEP workshop co-located with CHEP in San Francisco

Actions: see action list

Volunteer sites urgently needed for Machine/Job Features

Forthcoming meetings

EGI Community Forum, Bari, November 10-13
SuperComputing, Austin, November 15-20
DPM workshop, CERN, December 7-8
- https://indico.cern.ch/event/432642/registrations/participants
WLCG Workshop, Lisbon, Feb. 1-3›Followed by DPHEP workshop Feb. 3-4
HTCondor European Workshop, Barcelona, Feb. 29-March 4
- Including ARC CE workshop
- Indico event should be ready soon
ISGC 2016, Taipe, March 13-18›http://event.twgrid.org/isgc2016
- Call for Paper extended to November 18
CHEP, San Franciso, October 10-14
- WLCG workshop the week-end before (Oct. 8-9)

GDB Chair Election

2 candidates: Ian collier and Mattias Wadenstein

Low vote participation: a problem to be understood, a message?

May need to rediscuss the voters for the GDB chair election: a vote by country may be no longer relevant...

Ian Collier elected

I. Bird: let I. Collier decide whether he wants Mattias to be a deputy, may make sense

WLCG Network and Transfer Metrics WG - M. Babic

Mandate

Ensure all relevant network and transfer metrics are identified, collected, published
Ensure sites and experiments better understand and fix network issues

Monthly meetings, good progress

Baselining: understanding slow transfers
Integrating network metrics with the experiments
Good and successful collaboration with OSG and ESnet

perfSonar deployment has a been a core activity

245 active/up-to-date WLCG instances over 278 registered
WG monitoring instances problems

Metrics made available

OSG datastore gathers all metrics from all instances and made them available through esmond
CERN ActiveMQ publishes all the metrics too

Prototype proximity service (based on GeoIP) to handle topology-related requests related to metrics

Application integration

FTS: performance studies
- Augments FTS metrics with perfSonar traceroute metrics
- Identifies links with low transfer rate
- Demonstrated that because of SRM overhead it was better to multiplex more transfers on a reduced number of streams rather than the opposite
Integration with pilot infrastructure in Atlas and LHCb
- DIRAC: in addition to performance, intent is to use perfSonar metrics to commission links
- ATLAS: help to make a better usage of the network to reduce pressure on storage, in particular for dynamic/remote access. Idea is to build a source/dest cost matrix (with ElasticSearch) that could be used as an input by PanDA

A GGUS support unit created by the WG to troubleshoot network problems

Meshes

2 main meshes: LHCOPN and LHCONE
IPv6 mesh: for dual stack instances and comparing performances with IPv4
Belle2: for Belle2 sites

Next steps

Progress with pilot projects in collaboration with experiments
Integration with high-level services like MadAlert and PuNDIT
Easy integration with analytics platforms

Discussion

Marian:

the perfSONAR presentation at HEPiX (Oct 13) has more on:
- deployment
- how to debug issues

Simone:

to reduce authN/authZ overheads, could we look into using GridFTP directly where today the SRM is still used?
it would also allow authenticated sessions to be reused
some SE types (e.g. DPM) should support that through GridFTP v2

Michel:

dCache got there first
DPM 1.8.9 had a problem in that area that was fixed in 1.8.10, recently released

Mattias:

sessions can also be reused for the SRM, when https is used instead of httpg
- the latter protocol includes an automatic delegation that is only relevant for srmCopy (not important to WLCG)
not clear which implementations beside dCache support https

Michel:

the FTS can choose the best protocol per use case
the protocol zoo should be further reduced

Mattias:

currently SRM is at least needed for staging from tape

Michel:

that need may go away when tape endpoints are separated from disk endpoints everywhere

Clouds

Cloud Resource Reporting - L. Field

TF mandate: ensure that the cloud resources can be reported and thus counted against the site pledges

Proposal made in the documented published last Spring

Build upon APEL framework
- Publishers for main cloud MW
- Include HS06 normalization factor
- Tools to verify that information is published and correct
EGI accounting portal producing aggregates
REBUS extension to include information from the portal

Most of work completed.

Next steps:

Identify site wishing to integrate their private clouds
Extend REBUS reports to include the information

Recommend to terminate the TF!

Discussion

Simone:

validation of the accounting data is hard: shouldn't the TF also look into that?

Laurence:

the TF can work with a few sites to validate their initial data
in production the validation must be a continuous activity to avoid decay
other cloud systems can also be supported, they just need to have their own publisher

Alessandro:

the numbers need to be compared against the client-side accounting
therefore the experiments need to be involved

Maarten:

these activities may best be handled by a new WG under Ops Coordination

Michel:

Need a follow-up in the coming month else the work done so far may be useless
Need to identify volunteers no only in sites but in experiments: clouds make more complex matching the VO view (jobs) and the site view (VMs)

Second Price Enquiry - L. Field

CERN cloud procurement activity starting in the context of Helix Nebula to support CERN scientific program

Evaluate cost and benefit of IaaS
First procurement: March 15
- 1 VO, simulation jobs, 2K cores
- See presentation at GDB earlier this year
Second procurement just done: October 15
- multi-VO, simulation jobs, 4K cores
3d procurement planned in March 16 as part of the HNSciCloud PCP project

Experience gained from 1st procurement

Client-side accounting is crucial: validation of invoices, Ganga repurposed for accounting of used resources, data analytics
- Probably need a more tailored solution than the Ganga-based one for accounting of used resources
Benchmarking is also crucial to commoditize cloud resources: enable a "cloud exchange", required for procurements
- Need for a fast benchmark

2nd procurement: a small delay for starting the real work, due to GEANT delays, originally planned 1st of October, should start soon

GEANT delays: procedural delays due to the new use case for GEANT
Open to CERN member states, 1 or several firms
1K VMs of 4 vCPUs, 8GB memory, 100 GB storage
- Input/output from/to CERN EOS
Several provisioning APIs accepted (all main ones)
Benchmarking based on ATLAS KitValidation suite
- Open-source, easy/quick to run, reproducible (fixed seed for random generators)
- 2 thresholds: rejection (1.5 evts/s), compensation (1.2 evts/s)
Service delivery coordinated by IT WLCG Resource Integration Team, daily meetings
- Based on a "capacity manager" component: in charge of provisioning VMs for VOs based on demand
- VM reports to monitoring and data analytics that feed alarming systems
Experiment readiness/use cases
- ATLAS: running 4 core VM in multicore mode
- CMS/LHCb: 4 core VMs in single core mode. For CMS, handled by HTCondor which sees 4 slots and provides 4 single core jobs. For LHCb, 4 pilots run on the VM.
- ALICE: traditional single core VM

Discussion

Jeremy:

can you expand on the benefits to the experiments?

Laurence:

basically, what was done with our investment? for example, the number of events per dollar

Luca:

can you expand on the network connection costs?

Laurence:

the costs are for the provider to sort out before they make their offer
the delay was procedural this time, as GEANT needed to assess the new use cases
for the 2nd price enquiry we required connectivity to GEANT with public IP addresses
for the 3rd, we may specify only the necessary bandwidth
- we should also be fine with private IP addresses at some point

HNSciCloud PCP Project Status - B. Jones

Follow-up for the early CERN experience with procuring commercial cloud services

No longer CERN-specific
In the framework of H2020 as Pre-Commercial Procurement: well defined process for procuring services
- EC funding: 70%
- Buyers committed 1.6 M€ generating a total budget of 6 M€: manpower, application and data, IT services....
Starting January 1st, running 2.5 years
- Kick-off meeting at EMBL (Heidelberg, January 19-21

First phase (9 month): preparation

Tender preparation: requirement analysis, building stakeholder group, develop tender material
PCP must organise a dialog with potential suppliers to ensure that what will be requested is feasible: will be done in the context of Helix Nebula
- Defining the requirements is the responsibility of the procurers only: suppliers have no say

Second phase (15 months): implementation

Competitive phase
3 phases
- Design phase: >=3 solutions, 4 months, <= 15% tender budget
- Prototype development: >=3 solutions, 6 months, <= 25% tender budget
- Deployment: at least 2 solutions, 5 months, >= 60% tender budget

3rd phase (6 months): pilot and evaluation

Management structure/bodies in place, most people nominated

Evaluation committee: in charge of defining criteria who will be published withe the tender, both administrative (exclusion criteria) and technical (award criteria, what defines the best technical proposal)
User board: will need to be endorsed by Collaboration Board and members will need to declare conflicts of interest

Project required derogation to CERN procurement rules to comply with EC ones: agreed in Sept. by CERN council

Cloud resources will be made available to user groups during the pilot phase

WLCG, ELIXIR, EGI Fed Cloud, local users at each procurer sites
- WLCG: can be counted against site pledges

HNSciCloud requires WLCG to define the use cases that we want to run on these resources: avoid doing it by experiment, rather common generic use cases

Need a small group to work on this

T0 Update - H. Meinhard

Cloud resources increasing and will continue to increase

Doubled in the last year
Operating system upgrade from SL6 to CC7 on hypervisors started and expected to be done in the next 6 months
CPU performances: significant progress in understanding the various issues in the last months. Now have a configuration where the inefficiency is less than 5-6% (target is < 5%)
- One problem (NUMA-awareness) related to OpenStack and fixed only in the version currently being deployed (Kilo=
- Pre-deployment tests are not sufficient
- Different optimisations for different HW types: importance of the monitoring infrastructure
- Details presented at the last HEPiX
130K cores

Started to look at container support integration in OpenStack

Currently OpenStack support is mainly for LXC
Started to look into OpenStack Magnum new component providing container orchestration: will bring support for Docker through Kubernetes

Data services: CASTOR and EOS

CASTOR only in Meyrin, EOS evenly distributed between Meyrin and Wigner
EOS: 80% of CERN disk space
ATLAS and CMS now writing RAW data directly into EOS and move it to CASTOR in a second step
- ALICE and LHCb kept the original, opposite model with LHCb accessing all data from EOS after copying it where ALICE is doing the reconstruction with CASTOR
EOS peak: 30 GB/s (CASTOR only half of it)

Networking

Reconfiguration of datacenter network planned in February to reduce the number of machines exposed to LHCOPN/LHCONE
- For security reasons
3d 100G link between Meyrin and Wigner soon available (December or January), through the Nordic countries
- The conjuncton of a traffic sometimes above 100 Gb/s and a good offer: will allow to maintain a resilient capacity of 200 Gb/s

Batch

LSF upgrade from 7 to 9 in progress: master done, WNs to be done (following the usual staged/QA deployment)
HTCondor in production since the beginning of the week (Nov. 2): in fact no change to the already existing configuration, except that accounting is reported to the production instance
The 2 ARC CE originally set up about to be retired in favor 2 HTCondor CEs to reduce the number of products run
- where HTCondor is already used as batch system, trading the ARC CE for the HTCondor CE reduces the amount of middleware dependencies
- Brian Bockelman and Iain Steers decoupled the HTCondor CE from its original OSG packaging
- Detailed presentation of HTCondor at last HEPiX
Progressive transition from LSF to HTCondor
- Currently: HTCondor 4% of total batch
- Will move WNs by step of 500-1000 core weekly up to 25% (the percentage of grid submission in the total batch)
- Further migration after Krb ticket handling is working: a prototype expected by the end of 2015. Will include training and appropriate documentation.
- Target: no LSF-based service at the end of Run 2

MW: recent issues with FTS and ARGUS

Good collaboration with developers
ARGUS: mostly not related to ARGUS itself

Volunteer computing: CMS now using it at a high-level, ATLAS continuing its already established usage

Gitlab service in production, coupled with Continuous Integration: preparing to phase-out the current gitolite-based service as well as SVN

Very fast take up: already 1500 projects
gitolite in the short term
SVN: rundown targeted some time during LS2

Fall HEPiX Summary - H. Meinhard

Real workshop with honest exchange of experiences

Status, work done, work in progress, future plans

New web site hosted by DESY

http://www.hepix.org

Last workshop at BNL: 110 people, 32 affiliations, 82 contributions

https://indico.cern.ch/event/384358/timetable/#all.detailed
Extensive trip report attached to the HEPiX agenda

Storage and file systems

Ceph increasingly used
AFS: CERN moving away from it for home directories in favor of EOS + CERNBOX

Grid & clouds

Preliminary work to support Docker at several sites
OpenStack everywhere
Increasing interest/activities around commercial clouds

IT facilities

New datacenter entering production at LBNL
BNL proposing a new datacenter

Computing and batch systems

HTCondor use increasing: active participation of HTCondor development team
NVMe drives: next generation SSD?
- Bypass SATA interface for improved performances
Discussion on benchmark standard

Basic IT Services

ELK (or variants) at many places
InfluxDB + Grafana: an alternative to Ganglia?
Puppet at many sites, Quattor still has an active community

Next meetings

Spring: April 18-22, DESY Zeuthen (Berlin)
Fall: LBNL (Berkeley), probably the week after CHEP (Oct. 17-21
Several proposals to host the meeting received from Europe, NA and Asia: may revise the current cycle Spring Europe/Fall North America
- 2017: Spring probably in Budapest (Wigner), Fall probably at KEK

CPU Accelerator Support in EGI-Engage - M. Verlato

Task part of WP4/JRA2: main partners INFN and Slovak Academy for Informatics

Expose the appropriate information into the information system
Extend the submission interface to properly handle these resources
Do it first for grids, then for clouds
Developed a GPU-enabled portal for MoBrain
- AMBER and GROMACS applications

Work plan until May 2016

Identify the relevant batch system parameters related to accelerators and abstract them into JDL
- Restarted from the work done in EGI-Inspire
- New JDL attributes: GPUMode, GPUNumber
Implement needed changes into CREAM CE and BLAH
- A first prototype deployed at CIRMMP
Write info providers for GLUE 2.1
- How/Whether to report free GPU resources still being discussed
- Relying on current info provider maintainers for each batch system info provider
Accounting: discussion going on with APEL team
- Torque and LSF logs/accounting don't have the GPU usage information

Several RPs interested to participate but only covering LSF, SGE and SLURM: an HTCondor site would be very welcome

Also interested to get access to a site providing Xeon Phi rather than GPU

Cloud work/testbed by Slovakia Academy for Informatics

Based on OpenStack Kilo
Demonstration planned at EGI CF in Bari next week

Discussion

Michel:

it would be good to have HTCondor represented in the set of batch systems

Topic revision: r6 - 2015-11-05 - MichelJouvin

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback