LCG Web>WLCGGDBDocs>GDBMeetingNotes20180613 (2018-07-30, IanCollier)

EditAttachPDF

Summary of June GDB, June 13, 2018

Agenda
Introduction - I. Collier
Oracle contract 2018-2023
dCache Workshop report and future developments (Tigran)
WLCG Archival Storage Group (Oliver Keeble)
- “Tape/Data Carousel” (Vladimir)
Summary pre-GDB – CPU Benchmarking WG
Security: Romain Wartel (CERN), David Crooks (University of Glasgow)
HTCondor Week report: Helge Meinhard (CERN), James Adams (STFC RAL)
DOMA update: Maria Girone (CERN), Simone Campana (CERN)
Wrap up: Ian

Agenda

https://indico.cern.ch/event/651354/

Introduction - I. Collier

presentation

Main topics
July GDB is delayed by one week to avoid CHEP – so 18th July.
No meeting in August.

Oracle contract 2018-2023

Moved to CAMPUS license model in 2013 (non-perpetual 10,000 users)
Started move to perpetual processor licenses in 2018 (about 300) plus continue CAMPUS license with less oracle products, and free cloud services.
Now: aggressive DB consolidation; use hardware with less cores; optimise.
Reduce Oracle technology dependency.

Questions:

What is a perpetual license? – Pay once for the processor license and then further if maintenance is required.
For APEX many groups have moved. Recommend people to look at other technologies and not rely only on Oracle products.
What is the dialogue with T1s and T2s? The Campus License usage will be checked and we will then look at how this can be accommodated.

dCache Workshop report and future developments (Tigran)

presentation

Took place at the end of May. T1 & T2 sites plus non-LHC.
“Lot of new cool features”. Tigran takes over from Patrick as the lead.
6 people DESY / 2 FERMI and a few others.
Topics: Simple installation; support for non-LHC; event driven; REST-API; Data lakes.
A number of development areas: High-availability; scale-out; data protection….
Data management “in pocket”. Running services in docker container (e.g. FTS and Rucio…)
Function as a Service (e.g. open whisk).
Storage events – to control experiment workflow

Questions:

On slide 20. Can be anything FTS talks to on the sites not just dCache. The main driving issue is that the community does not have anyone to run FTS for them? Just note that CERN does run an open FTS that any community can request to use.

WLCG Archival Storage Group (Oliver Keeble)

presentation

Site survey and “making the most of tape”
Survey aimed at establishing best practice.
Queue, prototcol, ….
Draft recommendations: submit recalls as far in advance as possible; keep the queue as full as possible… but there are other factors to consider.
Discussion points: what payoff can we expect; should we make make recommendations on writing strategy?; Lot of diversity so what strategy should we adopt?
Need experiments to join the conversation.

Questions:

Prioritisation? It should not be there. It is a tape system, a robot etc. You cannot guarantee what will happen.
If there was a proper interface we could do the needed development.
The experiments need to understand the costs.

“Tape/Data Carousel” (Vladimir)

presentation

ATLAS have a 3 phase plan – testing at BNL.
Expectation vs reality: tape seen as 3-4x cheaper than disk. Put more on tape. But requires changes to workload and storage systems.
Tape infrastructure – about 1 drive per 400 tapes. Tape recall inefficient. Tape transfers to disk well below native drive rates. Often shared by VOs.
Do not wait until the last byte arrives to start processing!
Discussion points. Temporal collocation of data does not hold over time, more tape drives needed for sustained recall…
Monitoring: http://cern.ch/go/bTt6

Questions:

Temporal collocation. Experiments measure data popularity. Can foresee what will be hot or cold in next month. Would this help?
What you are saying is that hot and cold data change. Typical of what is happening now. Recalling recent data, written together, so good throughput but if you do the campigns and being in cold data that will faltten the system. It takes months to reshuffle data on tapes so the infra cannot follow.
ML: Ideally it should not matter as 5 years ago the data were packed temporally. Something sounds wrong with the 1600 tapes example being discussed (CMS recall).
No conclusion. Is it an anomaly or should we plan the infrastructure for this type of read out?
It may require some changes to the experiment data management approach that takes account of future uses.
We need to discover what causes data to be recalled across a large set of tapes. Or did the campaign try to combine too many things into one? It should be part of the WG mandate to look into this issue.
T1: Matthias very rare to have these large recalls.
ATLAS: This is R&D to understand the possible cost reductions. We know we can store more efficiently. Can learn from other communities. Can do better bulk retrieves. In good position with closeness between workload management and data management systems. Working with various T1s.
GS: Awareness of data sets is important – that’s how we process. Want all files in a dataset on same tape.
Alessandro: From BNL yesterday we got results. Try to recall 33 AOD datasets (not RAW) . 200k files stuck. We are working to run workflow as soon as sufficient files available and then chase the tails. ATLAS has started writing a document on a strategy in this area. So far do not see a need to worry about the robots, lots of software layers (configuration) to improve.
BNL allocated more tape drives than usual for those ATLAS tests. Why? To test scaling with tape drives. Would be useful to also do the tests with the actual number of drives that are typically available. Already learning about optimisations.
The way data is written is to do it fast so can not expect all dataset on same tape. It is a tradeoff.

Summary pre-GDB – CPU Benchmarking WG

presentation

5 main subjects and 30 people attended.
System performance and cost modeling WG. Good progress. Reference workloads shared. Trident tool looks interesting – comparing HS06 and some HEP workloads.
Studying SPEC CPU 2017 (SC17). How urgent is it to move from HS06?
Update on “passive” benchmarking (precision around 15%) – comparing with production job data.
Making progress towards HS17 (compiler flags and configurations)
Created script to trigger SC17.
Very high correlation (0.950 correlation coefficient of mean values so far) between SC17 and HS06 (64-bit).
SC17 is quite complicated so could use a sub-set.
Updates on vulnerabilities and mitigation impacts. From Meltdown/Spectre VM performance less than 2%.
GPUs: looking at benchmark candidates. No significant GPU performance VM vs baremetal.
HPC: changing conditions leads to changing performance.

Questions

What are the actions we take from these conclusions. In particular slide 17. Interesting to know the correlation SC17 and HS06. Is your suggestion something we should report to the MB and say for sites over next 2-3 years are happy to use HS06? Can we define the actions. A suite for HEP specific workloads would be very useful. There is also a third area about closing the loop with accounting.
Agree on message SC17 vs HS06. We started using it in 2009 so shows takes 2-3 years to understand. The message also needs to include a call for new applications (more multi-threaded etc.) and compare with SC17.
Is that the conclusion of the WG? That we will need 3 years to come up with a better suite?
No, the message is that without the new workloads little will change. Some speedups will happen at times because the actual jobs run on less utilized machines – systematics with passive benchmarking to be better understood on a sit-by-site basis.
LHCb and ALICE have shown there is not a good correlation between experiment workloads and HS06.
We need to work on the benchmark suite. But we can only start considering it when the experiment applications are available.
The discrepancy – they obtain workloads from all experiments. There are tools like Trident that should be used to look at the figures for ALICE and LHCb.

Security: Romain Wartel (CERN), David Crooks (University of Glasgow)

Romain Wartel: Future Authorisation WG report.

presentation

What is WG doing: define requirements for AuthZ. Why:? Evolving identity landscape. How to shield users form x509 complexities.
What input needed: Requirements, solution design, token schema
- Requirements: Areas: Vo membership Management, … Need input NOW.
- Solution design: more friendly. Proposed design. Token Translation at heart of system. Various questions outstanding. Two possible solutions: EGI Check-in and COManage, and INDIGO IAM. Additional integration required in some areas.
- Token schema: Hot topic, strong views. Services increasingly using token based AuthN/Z (OIDC,OAuth2). Need to get on the train.: WLCG should define one or more token schemas to be issued by VOs. ID token, AuthZ token. Groups/roles.
Next steps: July pre-GDB.

Side comment from Romain: very contentious topic, lots of solutions. Need to listen carefully to the various proposals.

Q (Coles): How decide between token schemas? Should be one that works for all schemas/services needed by WLCG.

Q: Is the WG the arena to reach a solution? Yes.

Q: Tigran: Consider commercial schemas/solutions? Not aiming to need custom clients. Work really about convergence on schema to eoncmas (!?) the field that’s out there.

Ian: Come to the pre-GDB and or join the WG to contribute if you have any concerns at all.

David Crooks: A word for site security teams

presentation

Early grid isolated – new sw stack, new hw etc. tended to be in DMZs. 15 years on, grid communities very well organised: >100 intrusions in 10 years, none related to grid computing. Most due to stolen credentials. Still facing tradition threats. Should be sharing security operations: need to build trust, networks, details on intrusions etc. Integrate the indicators of compromise from grid, clouds etc., directly with SOC.

Aim to make Grid computation yet another IT service from a security PoV.

Ultimate goal: increased collaboration between grid and site security teams.

Many challenges:

sites are different politically, technically, structurally. Need incentives for security teams.
Security handled differently at sites.
Formus and mechanisms: CDG, Trust groups, CSIRT community, WLCG WGs (eg SOC)
Grid security infrastructure.

Next steps: seed discussion: leverage existing trust groups, areas to expand, identify existing/upcoming work in this areas at particular sites. Any concrete intermediate steps?

SOC WS at CERN 27-29 July – 2.5 days. Encourage involvement from as many sites as possible. Helpful if register by 16th if intend to attend personally.

Question for the room: How is it handled at your site?

Tigran: Are there sites where the teams are separate? Yes: at CERN its done as one team, at others it is separate. Romain: struggle to consume threat indicators – grid and site security teams not always on the same page. Need to share. Issues that cause problems different every time.

Ian: Increase trust. A document won’t necessarily solve the problem.

David: Need collaboration for technical options to recommend for sites so that they can pick the one most appropriate. Sharing of threat intelligence more tricky – needs the trust level, sometimes at an interpersonal level. Longer term project. Hoping to use WG as a forum to explore this.

Romain: Sharing >5k threat indicators, but no one site can do all of it. Thus some doesn’t get followed up. Few sites doing much with the data.

Maarten: Sites ready to take the next steps whereas universities are not? Romain: maybe but commercial sites and govt much more advanced in consuming and using intelligence. Europe good at sharing information but bad and taking action. In the US, it’s the other way round (perhaps due to federal mandates).

Maarten: Need to broadcast successes to entice site to participate. David: need to check on state of current sharing and encourage over time using success stories etc.

Romain: MISP is directly driving security operations at CERN. Collaboration of this sort is vital, otherwise you’re on your own.

Coles: any MISP instances in other parts of academic community? Romain: yes. But not many other large communities as well organised as HEP.

Crooks: This is what we have and how can we contribute…

Ian: Maybe offer to contribute to campus security because we are better organised to get ahead.

Maarten: Deployment campaigns: start with T1s? Can this be done? Possible, but can be expensive.

Crooks: Need to look at site diversity.

Romain: don’t just look at academia: important to exploit collaboration with non-academic private-sector communities.

Crooks: offer what we have, exploit links.

Ian to Matthias: unusual T1? Yes. Crooks: WG wants to help with widely different sites. Discussion on Nordic T1 structures and problems. (Ian, Matt, David). Would benefit greatly from collaboration and threat sharing.

Note that some sites have no grid security office – rely on campus/site security office who may to even be aware of WLCG resources.

Romain: Why would any T1 not want to benefit from free threat information that commercial entities pay for?

Security Service Challenge: Sven Gabriel (Nikhef/EGI)

Want to run a challenge against the DIRAC framework, affecting LHCb.

Aim to improve collaboration between the various security elements and sites etc, familiarise folks with incident response procedures. Trace contain and investigate. Prepare partial and final reports. Assess IR capabilities in particular user management (banning at site, VO, CA etc., level.)
How: valid credentials used to run policy violating activities. Use the VO job management system. Implement sensors for site incident response actions – which has to be hidden form the sites (hard). Assessment, not a pentest. Warning being given. Looking for responses and timing of responses. Exercise will run for about a week.
Reviewed structure of test.
Schedule: not all blocks ready yet, tentative test date 16-21 July.
Infrastructure, checklists, monitoring etc.
ToDo list:
- various IR procedures needed, as are docs for site admin and VO.
- Reporting: defining performance metrics etc
- Generating SSC support wiki
Next steps: announcements, communication challenge, actual challenge.
- Communications need sorting out ahead of challenge proper.
- Ticket…
- Infect sites: time constraints etc.

Questions:

Maarten: Will Christophe be will not be allowed to be a defender in this game.

Ian: Interesting to see that this is a more involved challenge this time, reflects the increasing complexity of the VO frameworks. We await the results with interest.

HTCondor Week report: Helge Meinhard (CERN), James Adams (STFC RAL)

presentation

Report on HTCondor week.

Quick note on HTCondor. System of choice for a lot of WLCG.

Monday: tutorials; Tuesday / Wednesday: Developers, Site admins; Thursday: local science w/HTCondor
102 registered participants from lots of different communities.
Highlights that James and Helge found interesting (including but not only)
- Support for SciTokens (uses Kerberos framework).
- Support for Docker and Singularity
- Support for ipv6
- Direct submission into slum coming
- Jupyter notebooks
- Better support for GPUs
- Bypass scheduler to run jobs immediately
- Pre-emptable jobs while nodes drain
- Annexes for cloud bursting
- ‘Facilitators’ to hunt down possible users at UW and teach tem how to us eit.
Next European WS will be at ral Tue4th – 7th Sept.
- “Office Hours”…
- Registration open soon.

No questions.

DOMA update: Maria Girone (CERN), Simone Campana (CERN)

presenation

Next steps from Napoli:

Create a Data Organisation Management Access evolution project – keep track of developments in all areas, forum for discussion, foster interoperability, umbrella for stakeholders

Stakeholders: experiments, middleware developers

News from LHCC: start a review process of the WLCG strategy for the HL-LHC

Facilities and storage consolidation: Different storage consolidation models: geographically distributed instances. Fewer storage endpoints and more (large number) of processing facilities including HPCs and Clouds. Need to avoid loss of efficiency and hardware.

DOMA activities on Storage services:

Define and implement QoS in WLCG storage services
Share experiences with HW resiliency solutions
Implement standardised tools for storage management
GDB and HEPiX obvious places for discussing progress.

DOMA and Protocols:

SRM-free data management
Xrootd and primary option for GridFTP replacement and data access
HTTP access important solutions for non HEP apps
Proposal: WLCG operations takes lead for protocols commissioning / decommissioning – present plan, organises work. Agreed with Keeble et al.

DOMA and Data Access: Data locality not guaranteed. Client level caching, site level caching. Areas of study: providing hints about file re-use to the cache, pre-filling site caches Need to leverage all ongoing work in various WG studies. Caching and latency-Hiding is a key element of the strategy at many centres, including CERN.

DOMA and Workflows: Integrate the concept of QoS and Caching into workflows

QoS
Caching strategies
High I/O facilities
Tightly integrated with the WLCG Performance and Cost Model WG.

Upcoming work:

Next meeting end July
AAI and DOMA
Networks (understand the network implications)
Other communities outside HEP
Planning for regular monthly meetings, hopefully in conjunction with pre-GDB slots (September is open!)

Tigran: need to use documented features not reverse engineer facilities

Oliver: WRT consuming Storage Events – need better treatment than just a reporting.

Oliver: Looks as if Xrootd has been chosen as GridFTP replacement? Will we compare them?

Ian Bird: no: it’s a reflection of reality at the moment. Point is to find the best solution from discussions within the DOMA framework.

Oliver: DPM has caching component, would be good to see DPM on the list.

Wrap up: Ian

Evangelise the AuthZ WG!
Next meeting July.

-- IanCollier - 2018-07-05

Topic revision: r3 - 2018-07-30 - IanCollier

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback