Summary of May GDB, May 9, 2018


Agenda

https://indico.cern.ch/event/651353/

Introduction - I. Collier

presentation

  • Nordugrid meeting is focused on dev direction, aimed towards active direction of ARC
  • Where is the DOMA meeting? In CERN, room for 40-50 people on site, plus designed for US attendance via scheduling in the afternoon + Vidyo access

Integrating Dynafed into the ATLAS and Belle II data and workload management systems - F. Berghaus, R. Sobie

presentation

  • Issue - machines run on any cloud, but directly connecting to dCache/EOS limits I/O
  • CloudScheduler running in prod for 6 years, 8th birthday in a couple of days
  • Dynafed redirects
  • Belle-2 is "RO" in the sense that endpoint is RW, but jobs only read
  • Dynafed+Belle-2 on behalf of Marcus Ebert at UVic
  • 6 clouds in use can use single Dynafed
  • Minio? Easy to setup, not all clouds support S3/Swift...
  • DIRAC proxy used for auth
  • rucio supports gfal2
  • use Dynafed as standard SE
  • implement protocol in rucio to not worry about directories
  • Dynafed caches directory creation
  • rucio protocol to directly upload new filename
  • Maarten: MD5 is better than ADLER32, less chance of collisions (other reasons for using ADLER32): ALICE use MD5
  • Maaretn: Feasible to try either one, see which one works?
  • Rucio, other frameworks, use one immediately, other one where needed?
  • Frank: Have google doc to define what's needed to implement this procedure. Entirely possible, needs work to implement logic. Ask to ignore to get numbers on whether worth implementation
  • Simone: As well as legacy requirements, (Only ADLER32 for older files) new files have both. Q: In Dynafed have internal replication QoS?
  • Frank: No, Dynafed is only redirector, not data management. Have next steps...
  • Simone: Checksums, with replications need to worry about grid:grid, cloud:grid etc.
  • Frank: Part of requirements in various implementations also needs to go into gfal2 etc
  • Fabrizio: My perspective, not only 2 kinds of checksum, difficulty is checksum is read for some cases, write for others
  • Simone: If one is up to the clients vs up to the provider...
  • Fabrizio: Idea to stop checking in short term is the best
  • Frank: Oliver Keeble has done tests on how well cloud storage guarantees data consistency, at our scale do need to worry about it
  • Automated replication: dCache issue, when disabled this worked
  • Note on deployment strategy: small patch -> month to deploy with different stages
  • Can make things tricky (an observation)
  • Monitoring: ELK stack at Triumf
  • next MC: 85GB in total before, 9TB for next one
  • more tricky to replicate across storage
  • manage "by popularity"
  • was 5GB/file, now 2GB/file
  • ATLAS functional tests after checksum patch
  • Streaming data results - as a result of https/davs requirement for auth has effect bc of encryption?
  • Q: Dynafed auto directs to closest storage for write - what if it is full?
  • Frank: Use python plugin, in design
  • Frank: query each endpoint for free space
  • Maarten: obviously running out of space is a reason for failure. Need retry mechanism in case of other problems?
  • Frank: One thing Dynafed does is status checks on endpoints. Ought to show up functional problems. Retry? Dynafed is only redirector
  • Maarten: Often nor presented with list of endpoints. If first fails, retry. Better to have retry?
  • Fabrizio: Dynafed probes endpoints every 10-15 seconds. If one is nonfunctional, looks to [next] to be up? Also features of copy program to try again?
  • Maarten: If everything is behind Dynafed, can you ask for next best?
  • Frank: Metalink, get ordered links for read for next best. Could use something similar for write?
  • Ian: [More discussion] beyond scope for here, more than just your activity, good to see how this develops.

A framework for setting up Lightweight Sites in WLCG - M. Sharma

presentation

  • BOINC shows potential of small sites
  • [slide 9] Centrally configured means at site level
  • [slide 17] Sample implementations of configuration management via link
  • [slide 20/21]Step 5: final deployment
  • Other lightweight ideas on roadmap follwoing v1.0

Lightweight Sites WG proposal - A. McNab

presentation

  • See discussion below

Lightweight Sites Discussion

  • Alessandra: have WLCG Ops Coordination for this purpose?
  • Andrew: This is more focussed on particular set of models
  • Maarten: Ops Coord is mostly concerned with big sites
  • Alessandra: Not intended that way
  • Maarten: But it is in effect. Entities for development, add interest group, no brainer
  • IanC: In asense, Ops smoothes what we've agreed to do; an opportunity to change agreements
  • Andrew: (strategic)
  • Alessandra: GridPP few weeks ago proposed everything in previous talk, was rejected (hackathon)
  • Maarten: More completely, nudge in better direction. Ops coord: near future. This has bit of R&D. Hangs off GDB. All activities interconnected, lightweight initiatives under single umbrella
  • Ian: Does occur: we've had beginnings of other related activities - HPC? Slightly floundered during broadening into provisioning. Remain separate?
  • Andrew: Different kinds of sites
  • Alessandra: HPC different kind of site
  • Matthias: HPC relevant for lightweight sites
  • Ian: How do we, one problem with HPC is every machine is different. Possibilities:1) HPC sign up to provide consistency, 2) up to us to provide consistent approach
    • pull all use cases together
    • not necessarily the most lightweight, but more applicable?
  • Maarten: Orthogonal. Cannot have homogeneous grid. One site vs another. Matrix, pick one that applies. HPC try for more homogeneity. Unlikely to be universal.
  • Q: Cream CE in a container? Rids me of classic service
  • Frank: By design common technologies (containers) specific challenges. HPC different network context.
  • Andrew: HPC restrictive, large resources. Lightweight half a person, more freedom
  • Randall: Belle2 would benefit from this
  • Alessandra: Requests for such sites come very often, sites labelled T1s
  • Simone: First problem, how do you respond if site shows up with spare capacity? Large sites with reduced manpower will have continuity. New sites, we don't have recommendation. Have recommendations for "I have this"
  • Alessandra: By now have some recs
  • Maarten: Avoid saying running a grid site is a piece of cake.
    • Walk fine line between things
    • Takes a lot of expertise to solve all the problems are bringing
    • ALICE, promise of growth but doesn't happen
    • Our central experts have to worry less about debugging a particular site
  • Simone: A full spectrum of possibilities
  • Andrew: Quality of Service
  • Maarten: ATLAS accumulated, if your site is in this category, aim for particular kind of service. In ALICE not that kind of service.
  • BOINC in ALICE is a no no because of security consideration
  • Documented enough to make educated choices
  • Alessandra: Not straightforward thing to be added, even if they use BOINC, still have to go through few days of configuration - T3s, T2s
  • IanC: Recognise that this has distinct use, should be set up, Andrew to pursue. Think about how it relates to other provisioning later

ACTION: Andrew McNab to pursue the setting up of Lightweight Sites WG

CNAF Incident - G. Bagliesi, I. Glushkov, L. dell'Agnello, M. Litmaath, S. Roiser

INFN LHCb ATLAS ALICE CMS

  • CNAF, each Friday communication could have involved central operations
  • Ian Bird: Reiterate that the CNAF team were open and clear, explaining at every level - [this] was very good. Don't overreact, serious service since 2003, 2 major incidents. This + fire in ASGC. More difficult to handle during data taking? Tape at CERN, + delay of 3 months. Almost no risk. Very little point in investing to avoid 10% loss of access for couple of months. Careful not to overreact, but some lessons to be learnt.

  • Latchezar: Handled beautifully, no single aspect of comment. Data distribution? Losses compensated. Consolidated? If you have few big sites and one goes down, bigger problem. If you lost 25-30%?
  • Ian B: Don't imagine in consolidation, fewer data stores. If CMS, Fermilab would be a problem, CERN for everyone.

  • Latchezar: Imagine Italy consolidated to one site, which then went down
  • Ian B: But still only temporary
  • Luca: CNAF: Italian consolidation for smaller problems, issues because they rely on CNAF
  • Maarten: Find middle of the road. Maybe not optimal so far, don't overcorrect. Data management landscape easier through consolidation in various ways. Impressive, endorses some design decisions of the past. Look back at this in a period of time.

OSG Update

presentation

  • IU Grid Operations Center (GOC). Cybersecurity team not affected
  • grid.iu.edu -> opensciencegrid.org
  • CILogon 2 allows use of COManage, c.f. EGI Check In
  • Retirement of OSG CA
  • Brian: ... clear indication that funding agencies don't want to support independent project CA
  • Maarten: not quite done deal, still have discussions, what to change and where
  • Romain: As discussed with other additional key stakeholders, the deployment of the Let’s Encrypt (LE) CA on WLCG resources brings significant risks to our trust model as a global community.
  • Brian: Splitting between WLCG/non-WLCG resources
  • Romain: How do you plan to support non-WLCG OSG users at non-OSG sites (for example in Europe), do they need to deploy the Letsencrypt CA?
  • Brian: Most OSG Vos that overlap with EU are at Universities.
  • Stefan: I'm astonished, IGTF are deeply involved. What are the implications for CERN, other sites? Would like to see this discussed. Don't know all the implications. Representing CERN Security am not endorsing this.
  • Brian: Through travel, hard to get to email chains, F2F, Vidyo, would like if we can have some telecons. Agressively contacting sites weekly, soon biweekly, re host certificates to renew as soon as possible, gives 13 month window
  • Romain: I recognise that the OSG is in a tough area, I suggest you approach InCommon urgently to find a solution.
  • Brian: Not just InCommon, other CA might be willing to offer a different model
  • Romain: InCommon is an important IGTF partner, surprised they have not been more involved, might like Jim to make a statement. Why is InCommon not stepping forward? They are in the same community.
  • Brian: Wide majority of sites will be using InCommon. Talking about small handful of T3 sites.
  • Romain: All users directed to use CERN, has been discussed?
  • Maarten: Already Vos can have CERN user certs, so no news there. Online CA, probably easiest for users.
  • Brian: No. of OSG User certs has dwindled recently. Non-WLCG Vos tend not to use x509/PKI, see ALICE token based approach
  • Maarten: The slide says that OSG CA is going to disappear, is that reversible?
  • Brian: CA itself is 2 pieces. CILogon + Registration Authority. CILogon is not going anywhere. Cost of moving RA to different site, only extending funding for that? More expensive than solving IGTF problems at individual sites.
  • Jeff Templon: Cost?
  • Brian: InCommon, website says cost depends on University. We pay 15k/yr. All US T2, Universities already paying the fee. Leaves T3s, and again most sites already covered, leaves small handful.
  • Romain: Did you get a quote from InCommon?
  • Brian: Yes. Numbers for T1s came back much nicer than 2014.
  • Romain: Cost for a solution tailored to OSG, not based on previous quotes for other organizations?
  • Brian: No. The fact that we have coverage for WLCG sites not sure they'd want to cover the others [I think]
  • Stefan: 20k/year an official quote? Maximum? Peak at 5,10,20k?
  • Brian: InCommon CA Website, depending on classification of University. R7/R8 down to 2k, R1 15(members)-20(non-members)
  • Jeff: Analysis about tier layers, problem for T3 sites, sets scale of the problem. Guess $10k/yr, prob better than that. Recommend not accepting developments costing more than $10k/yr to save $10k/yr. Costs of letsencrypt change will cost more than that. Security leaks everywhere, one weak point troubles entire infrastructure. Threats leak. Before using letsencrypt, get all security guys in one place.
  • Brian: As community, discuss our use of host certificates. Level of assurance are set more appropriately for user certs.
  • Romain: You're pursuing a cost based approach without having a quote from InCommon
  • Ian Bird: Brian implied relaxing level of assurance behind users?
  • Brian: No, feel that user assurance is appropriate, discuss host assurance.
  • IanC: As opposed to assurance for hosts at centre of services?

ACTION: Romain to organise VC with relevant parties, report back next month.

  • Romain: Suggest Brian Bockelman, David Groep, Dave Kelsey, Jim Basney - anyone else?
  • Brian: Open meeting? Will press crowd, solicit opinions. Stirring pot at GridPMA meeting?

Security - next steps in evolving trust frameworks

  • Postponed for lack of time to June GDB

-- DavidCrooks - 2018-05-09

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2018-05-23 - IanCollier
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback