Noted ongoing work for 1Q20: validation, integration of remaining workloads, consolidation of WL reports, HEP Score and Benchmark suite implementations, comparison of Docker vs

Singularity HEP scores, running at scale, compare performance of benchmarks and standard jots, comparisons with HS06 and SPEC 2017.

Q: How long does it take to run? Hours but can be reduced by reducing number of events. Workloads can be adjusted too suite, run once rather than three times.
Obs: Dune tracking what is being done, may turn up with workload to be included / be useable.
Q: HEP in name? Not fixed can use infrastructure to do any suitable benchmarks.
When will framework be presented as a framework for other uses? HEP Workload layer is the only HEP specific HEP layer, the rest can be used anywhere. Need to use a name which is more generic – simple. Discussions on naming of benchmark (score) and the framework for wider use without tying to the HEP brand.

Benchmarking/Cost Modelling Report: Markus Schulz (CERN)

slides

Summary:

By measuring for each HPC system and each workload an individual “HS06”-equivalent value pledges can be based on the throughput they provide for the experiments.
A second metric is desirable to assess the fraction of the potential of the pledged resource that is exploited (Realised Potential)
But overheads…

Cost and Performance Modelling Working Group.

Notes on recent progress. Resource needs estimation – experiments have a code-based machinery to extrapolate their resource needs up to Run4 scenarios, input parameter set almost the same. Application performance studies – measuring effects of memory restrictions and bandwidth, network bandwidth and latency. Detailed profiling of CMS workloads using Trident. Compiler options. Cache simulation studies. Data access and popularity studies (PIC).

Discussion of use of word of pledge v allocation. Markus solving how to value the pledges. Perhaps concept of credit.
Ian: Much technical work moving elsewhere? Need a synthesis to pull it together. Markus: perhaps need to reinvent and do a gap analysis to identify what’s missing.

CRIC project evolution: Panos Paparrigopoulos (CERN)

slides

Framework providing a centralised way to describe which resources are provided to the LHC and ow they use them. Distinguishes between resources provided by sites and resources used by experiments. Walk through of CRIC tools. Privilege request/approval. Q: Sync with GocDB? Yes, but quick for non-GocDB. Live table editing. Accounting data validation interface. Report Generation Interface. Rebus Functionalities. Update on current status for experiments: CMS full production,; WLCG production-grade deployment since summer; Atlas – migration from AGIS started, planned completion by Jan 2020; Rucio DOMA – testing how to provide Rucio configuration via CRIC. Future plans: replacing REBUS for topology and pledges; finalising account validation and report generation; … Supporting user info for Rucio. Longer term plans: complete migration from AGIS to CRIC, assist CMS with Rucio configuration when they are ready. Notes on support model: collaboration between CERN IT WLCG and Experiments.

Lunch

European HTCondor Week report: Helge Meinhard (CERN)

slides

Described the complementary workshops to the HTCondor weeks in Madison WI. Various precursors in the past, noted the current series culminating at JRC (24-27 Sept, ½ day more than previously). Additionally covered HTCondor-CE this time. Hosted EU Joint Research Centre, Ispra, Lombardy, Italy. Very close to Lago Maggiore. Previously a EURATOM research site, now a multi-disciplinary centre for policy research supporting EC policy decisions. JRC has a number of interesting HTCondor use cases. 61 attendees from various places. Dominated by academia and research, need more commercial side participation. Not dominated by physics in general or HEP in particular. 38 presentation, 22 different speakers, 19 hours of talks. Condor team 11h 20 minutes – 3 folk accounting for about 11 hrs. Comprehensive coverage of HTCondor and HTCondor-CE. Recommended the talk on Philosophy and Architecture of HTCondor. Summarised user presentations. Office Hours – to have booked time for HTCondor team + more experienced users to have one-on-one discussions. Time for site visits: ESSOR Nuclear reactor, Smart Grid Interoperability Laboratory, ELSA – European Laboratory for Structural Assessment. Good insight into the work of the JRC. Workshop feedback very good – HTCondor team report more and better feedback from European HTCondor events than HTCondor week. Next workshop probably second half Sept 2020, 3.5 days, no site decided yet, offers welcome.

HPC update: Maria Girone (CERN)

slides

Very much work in progress. Motivation – HL-LHC needs resources booth HW and software. Looking at ways to close the resource gap, particularly storage. Compounding the gap problem is the slowing in technology evolution of standard processors used so far, so looking at alternative technologies such as GPUs. HEP community doesn’t drive CPU design. Intel dominated, AMD smaller feature size with higher core count. IBM has power series, early adoption of PCIe-4. GPU improvements remains on doubling performance every 18 months. Effort to offload parts of reconstructions workflows to GPUs. ML frameworks profitable for GPUs. Reviewed HPC centre investments – growth by factor 20 on HL-LHC timescale. HPC performance will be 90% form GPUs. Two largest are Summit and Sierra on IBM Power 9 with NVIDIA GPUs. Need to engage. Beginning to work with HPC centres essential for HEP. DOE and NSF in USA, PRACE and EuroHPC in Europe. Review of some regional HPC community coordination, and some direct collaborations. Need code portability and sustainability. Portability libraries such as Kokkos, Alpaka, SYCL, etc. Use of HPC is challenging for HEP – notes on some of the many challenges including authN/authZ, data access, policies, applications not designed to exploit HPC style resources. Document summarising these issues in available (See link in slides), based on documents prepared by experiments. Aimed a helping engagement with HPC sites.

Security/Trust Follow up on FNAL meetings: Hannah Short (CERN)

slides

Lots of meetings surrounding the GDB in September at FermiLab including

pre-GDB
mini-FIM4R
AuthZ F-2F etc.

Noted key outcomes on WLCG token handling, various challenges for Dune FermiLab, information sharing between labs and experiments. Listed some topics for continuation – combined assurance, trust fabric of OIDC/Oauth, token flows, Oauth challenges, additional attribute collection. Slides with more details on these areas. Discussions will continue in various areas.

Discussion: One big issue appears to be the DoE in the US wanting to have access to nationality data to allow them to discriminate against certain nationalities as to what they may be provided with.

Need to raise the issue with the Management Board.

Site storage survey - results, conclusions, action: Oliver Keeble (CERN)

slides

Interim report on QoS Site Survey from the past few months. Results on Twiki, with some analyses too. 80 sites responded. 8 questions. No major surprises.

Q1 – Underlying media. Most using enterprise media. CERN experience of consumer drives is inconclusive. Isolated experimentation with alternative types.
Q2 - Media combinations. RAID6 with 12-16 disks over 2/3 of sites/ Rest do JBOD with additional layer such as Ceph, EOS, HDFS, and GPFS. Where do we go from there? Abandoning redundancy would give ~15%.
Q3 – storage systems. Few surprises in grid storage systems. Underlying resource more diverse – local mounted fs, shared fs (Ceph, lustre) block (Ceph), deeper integration (HDFS, Ceph). Almost always share POSIX too.
Q4 - Effort. Inconclusive – no clear mapping of efficiency onto system, difficult to quantify effort.
Q5 – Storageless sites. Most T2 who responded are no planning to and do not want to move to Storageless setups. T1s indicate uncertainty about the impact of Storageless sites on their systems.
Q6 – non-WLCG communities. Practically all T1s already sharing resources – no major problems identified, LHC experiments tend to be on isolated instances.
Q7 – Further directions. Ceph is cited in numerous contexts. No strong signal on site directions regarding coast saving. Many sites have no exploratory activities, others concerned about breaching the MoU.
Q8- Experiment Workflows. Concern about granularity of matching workflow to resources – QoS classes. Concern about cost and disruption of WAN access to custodial sites. Desire for better tools provided by the experiments to the sites.

What now? Review of site level activities. Procurement, densification and media issues. Software Defined Storage. Possible site- and Grid-level activities and actions reviewed. Invite sites to 1paricipate in the actions. Grid-level actions driven directly by the WG.

GDB Future: Ian Bird (CERN)

Finding the next chair of the GDB.

As of this discussion: 3 European and 4 from US. Pursue the idea of overlapping co-chairs – with at least a 2 year stint, and out of phase.
Vote at next meeting
Send names to list, with one paragraph / three bullet points ‘manifesto’, anyone on the list can vote.
Need to check up on whether the nominees are ready, willing and able.
Mechanics of how co-chairs would work to be discussed.

Helge noted the GDB MoU has details of the voting processes.

-- IanCollier - 2019-11-19

Topic revision: r1 - 2019-11-19 - IanCollier

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback