LCG Web>WLCGGDBDocs>GDBMeetingNotes20180509 (2018-05-23, IanCollier)

EditAttachPDF

Summary of May GDB, May 9, 2018

Agenda
Introduction - I. Collier
Integrating Dynafed into the ATLAS and Belle II data and workload management systems - F. Berghaus, R. Sobie
A framework for setting up Lightweight Sites in WLCG - M. Sharma
Lightweight Sites WG proposal - A. McNab
Lightweight Sites Discussion
CNAF Incident - G. Bagliesi, I. Glushkov, L. dell'Agnello, M. Litmaath, S. Roiser
OSG Update
Security - next steps in evolving trust frameworks

Agenda

https://indico.cern.ch/event/651353/

Introduction - I. Collier

presentation

Nordugrid meeting is focused on dev direction, aimed towards active direction of ARC
Where is the DOMA meeting? In CERN, room for 40-50 people on site, plus designed for US attendance via scheduling in the afternoon + Vidyo access

Integrating Dynafed into the ATLAS and Belle II data and workload management systems - F. Berghaus, R. Sobie

presentation

Issue - machines run on any cloud, but directly connecting to dCache/EOS limits I/O
CloudScheduler running in prod for 6 years, 8th birthday in a couple of days
Dynafed redirects
Belle-2 is "RO" in the sense that endpoint is RW, but jobs only read
Dynafed+Belle-2 on behalf of Marcus Ebert at UVic
6 clouds in use can use single Dynafed
Minio? Easy to setup, not all clouds support S3/Swift...
DIRAC proxy used for auth
rucio supports gfal2
use Dynafed as standard SE
implement protocol in rucio to not worry about directories
Dynafed caches directory creation
rucio protocol to directly upload new filename
Maarten: MD5 is better than ADLER32, less chance of collisions (other reasons for using ADLER32): ALICE use MD5
Maaretn: Feasible to try either one, see which one works?
Rucio, other frameworks, use one immediately, other one where needed?
Frank: Have google doc to define what's needed to implement this procedure. Entirely possible, needs work to implement logic. Ask to ignore to get numbers on whether worth implementation
Simone: As well as legacy requirements, (Only ADLER32 for older files) new files have both. Q: In Dynafed have internal replication QoS?
Frank: No, Dynafed is only redirector, not data management. Have next steps...
Simone: Checksums, with replications need to worry about grid:grid, cloud:grid etc.
Frank: Part of requirements in various implementations also needs to go into gfal2 etc
Fabrizio: My perspective, not only 2 kinds of checksum, difficulty is checksum is read for some cases, write for others
Simone: If one is up to the clients vs up to the provider...
Fabrizio: Idea to stop checking in short term is the best
Frank: Oliver Keeble has done tests on how well cloud storage guarantees data consistency, at our scale do need to worry about it
Automated replication: dCache issue, when disabled this worked
Note on deployment strategy: small patch -> month to deploy with different stages
Can make things tricky (an observation)
Monitoring: ELK stack at Triumf
next MC: 85GB in total before, 9TB for next one
more tricky to replicate across storage
manage "by popularity"
was 5GB/file, now 2GB/file
ATLAS functional tests after checksum patch
Streaming data results - as a result of https/davs requirement for auth has effect bc of encryption?
Q: Dynafed auto directs to closest storage for write - what if it is full?
Frank: Use python plugin, in design
Frank: query each endpoint for free space
Maarten: obviously running out of space is a reason for failure. Need retry mechanism in case of other problems?
Frank: One thing Dynafed does is status checks on endpoints. Ought to show up functional problems. Retry? Dynafed is only redirector
Maarten: Often nor presented with list of endpoints. If first fails, retry. Better to have retry?
Fabrizio: Dynafed probes endpoints every 10-15 seconds. If one is nonfunctional, looks to [next] to be up? Also features of copy program to try again?
Maarten: If everything is behind Dynafed, can you ask for next best?
Frank: Metalink, get ordered links for read for next best. Could use something similar for write?
Ian: [More discussion] beyond scope for here, more than just your activity, good to see how this develops.

A framework for setting up Lightweight Sites in WLCG - M. Sharma

presentation

BOINC shows potential of small sites
[slide 9] Centrally configured means at site level
[slide 17] Sample implementations of configuration management via link
[slide 20/21]Step 5: final deployment
Other lightweight ideas on roadmap follwoing v1.0

Lightweight Sites WG proposal - A. McNab

presentation

See discussion below

Lightweight Sites Discussion

Alessandra: have WLCG Ops Coordination for this purpose?
Andrew: This is more focussed on particular set of models
Maarten: Ops Coord is mostly concerned with big sites
Alessandra: Not intended that way
Maarten: But it is in effect. Entities for development, add interest group, no brainer
IanC: In asense, Ops smoothes what we've agreed to do; an opportunity to change agreements
Andrew: (strategic)
Alessandra: GridPP few weeks ago proposed everything in previous talk, was rejected (hackathon)
Maarten: More completely, nudge in better direction. Ops coord: near future. This has bit of R&D. Hangs off GDB. All activities interconnected, lightweight initiatives under single umbrella
Ian: Does occur: we've had beginnings of other related activities - HPC? Slightly floundered during broadening into provisioning. Remain separate?
Andrew: Different kinds of sites
Alessandra: HPC different kind of site
Matthias: HPC relevant for lightweight sites
Ian: How do we, one problem with HPC is every machine is different. Possibilities:1) HPC sign up to provide consistency, 2) up to us to provide consistent approach
- pull all use cases together
- not necessarily the most lightweight, but more applicable?
Maarten: Orthogonal. Cannot have homogeneous grid. One site vs another. Matrix, pick one that applies. HPC try for more homogeneity. Unlikely to be universal.
Q: Cream CE in a container? Rids me of classic service
Frank: By design common technologies (containers) specific challenges. HPC different network context.
Andrew: HPC restrictive, large resources. Lightweight half a person, more freedom
Randall: Belle2 would benefit from this
Alessandra: Requests for such sites come very often, sites labelled T1s
Simone: First problem, how do you respond if site shows up with spare capacity? Large sites with reduced manpower will have continuity. New sites, we don't have recommendation. Have recommendations for "I have this"
Alessandra: By now have some recs
Maarten: Avoid saying running a grid site is a piece of cake.
- Walk fine line between things
- Takes a lot of expertise to solve all the problems are bringing
- ALICE, promise of growth but doesn't happen
- Our central experts have to worry less about debugging a particular site
Simone: A full spectrum of possibilities
Andrew: Quality of Service
Maarten: ATLAS accumulated, if your site is in this category, aim for particular kind of service. In ALICE not that kind of service.
BOINC in ALICE is a no no because of security consideration
Documented enough to make educated choices
Alessandra: Not straightforward thing to be added, even if they use BOINC, still have to go through few days of configuration - T3s, T2s
IanC: Recognise that this has distinct use, should be set up, Andrew to pursue. Think about how it relates to other provisioning later

ACTION: Andrew McNab to pursue the setting up of Lightweight Sites WG

CNAF Incident - G. Bagliesi, I. Glushkov, L. dell'Agnello, M. Litmaath, S. Roiser

INFN LHCb ATLAS ALICE CMS

CNAF, each Friday communication could have involved central operations
Ian Bird: Reiterate that the CNAF team were open and clear, explaining at every level - [this] was very good. Don't overreact, serious service since 2003, 2 major incidents. This + fire in ASGC. More difficult to handle during data taking? Tape at CERN, + delay of 3 months. Almost no risk. Very little point in investing to avoid 10% loss of access for couple of months. Careful not to overreact, but some lessons to be learnt.

Latchezar: Handled beautifully, no single aspect of comment. Data distribution? Losses compensated. Consolidated? If you have few big sites and one goes down, bigger problem. If you lost 25-30%?
Ian B: Don't imagine in consolidation, fewer data stores. If CMS, Fermilab would be a problem, CERN for everyone.

Latchezar: Imagine Italy consolidated to one site, which then went down
Ian B: But still only temporary
Luca: CNAF: Italian consolidation for smaller problems, issues because they rely on CNAF
Maarten: Find middle of the road. Maybe not optimal so far, don't overcorrect. Data management landscape easier through consolidation in various ways. Impressive, endorses some design decisions of the past. Look back at this in a period of time.

OSG Update

presentation

IU Grid Operations Center (GOC). Cybersecurity team not affected
grid.iu.edu -> opensciencegrid.org
CILogon 2 allows use of COManage, c.f. EGI Check In
Retirement of OSG CA
Brian: ... clear indication that funding agencies don't want to support independent project CA
Maarten: not quite done deal, still have discussions, what to change and where
Romain: As discussed with other additional key stakeholders, the deployment of the Let’s Encrypt (LE) CA on WLCG resources brings significant risks to our trust model as a global community.
Brian: Splitting between WLCG/non-WLCG resources
Romain: How do you plan to support non-WLCG OSG users at non-OSG sites (for example in Europe), do they need to deploy the Letsencrypt CA?
Brian: Most OSG Vos that overlap with EU are at Universities.
Stefan: I'm astonished, IGTF are deeply involved. What are the implications for CERN, other sites? Would like to see this discussed. Don't know all the implications. Representing CERN Security am not endorsing this.
Brian: Through travel, hard to get to email chains, F2F, Vidyo, would like if we can have some telecons. Agressively contacting sites weekly, soon biweekly, re host certificates to renew as soon as possible, gives 13 month window
Romain: I recognise that the OSG is in a tough area, I suggest you approach InCommon urgently to find a solution.
Brian: Not just InCommon, other CA might be willing to offer a different model
Romain: InCommon is an important IGTF partner, surprised they have not been more involved, might like Jim to make a statement. Why is InCommon not stepping forward? They are in the same community.
Brian: Wide majority of sites will be using InCommon. Talking about small handful of T3 sites.
Romain: All users directed to use CERN, has been discussed?
Maarten: Already Vos can have CERN user certs, so no news there. Online CA, probably easiest for users.
Brian: No. of OSG User certs has dwindled recently. Non-WLCG Vos tend not to use x509/PKI, see ALICE token based approach
Maarten: The slide says that OSG CA is going to disappear, is that reversible?
Brian: CA itself is 2 pieces. CILogon + Registration Authority. CILogon is not going anywhere. Cost of moving RA to different site, only extending funding for that? More expensive than solving IGTF problems at individual sites.
Jeff Templon: Cost?
Brian: InCommon, website says cost depends on University. We pay 15k/yr. All US T2, Universities already paying the fee. Leaves T3s, and again most sites already covered, leaves small handful.
Romain: Did you get a quote from InCommon?
Brian: Yes. Numbers for T1s came back much nicer than 2014.
Romain: Cost for a solution tailored to OSG, not based on previous quotes for other organizations?
Brian: No. The fact that we have coverage for WLCG sites not sure they'd want to cover the others [I think]
Stefan: 20k/year an official quote? Maximum? Peak at 5,10,20k?
Brian: InCommon CA Website, depending on classification of University. R7/R8 down to 2k, R1 15(members)-20(non-members)
Jeff: Analysis about tier layers, problem for T3 sites, sets scale of the problem. Guess $10k/yr, prob better than that. Recommend not accepting developments costing more than $10k/yr to save $10k/yr. Costs of letsencrypt change will cost more than that. Security leaks everywhere, one weak point troubles entire infrastructure. Threats leak. Before using letsencrypt, get all security guys in one place.
Brian: As community, discuss our use of host certificates. Level of assurance are set more appropriately for user certs.
Romain: You're pursuing a cost based approach without having a quote from InCommon
Ian Bird: Brian implied relaxing level of assurance behind users?
Brian: No, feel that user assurance is appropriate, discuss host assurance.
IanC: As opposed to assurance for hosts at centre of services?

ACTION: Romain to organise VC with relevant parties, report back next month.

Romain: Suggest Brian Bockelman, David Groep, Dave Kelsey, Jim Basney - anyone else?
Brian: Open meeting? Will press crowd, solicit opinions. Stirring pot at GridPMA meeting?

Security - next steps in evolving trust frameworks

Postponed for lack of time to June GDB

-- DavidCrooks - 2018-05-09

Topic revision: r5 - 2018-05-23 - IanCollier

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback