Notes of GDB meeting, June 13th 2012

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=155069

Welcome, Michel

Default is to have them on 2nd Wednesday of each month
The August meeting is cancelled
Non-CERN meetings: October in Annecy, March 2012 most probably in Germany
Topics for July GDB: m/w support in the post-EMI era, T1SCM report, CVMFS deployment status, RFC proxies, DPM

Post-TEG Working Groups, Ian Bird

Proposal for a list of areas where to focus effort NOW. Create WGs or hold specific meetings (eg pre-GDB):
- Data management
  - Federation: May be a pre-GDB meeting. Show what people is already doing (information flow)
  - Benchmarking: small WG to perform a "market survey" for storage performance analysis tools
  - Networking: Michael Ernst agreed to propose a mandate and members of this WG
- Workload Management
  - Definition of extensions to CEs
  - Information System: need somebody to lead this and continue the work started 1 year ago
- Security
  - Romain: would like to propose 4 working groups (see security presentation later)
- Database: No special issues seen. Pre-GDB for sharing experiences, but may be not more
- Operations
  - Software process post-EMI: short lifetime WG to define how do we manage this in 1 year from now
  - Monitoring: integration & coordination of the many monitoring activities. We need an overview
- Other things:
  - Setting up the Operations Coordination Team (Operations TEG recommendation)
  - Various fora or discussions in the pre-gdb day to discuss about interesting technologies (nosql, hadoop, ...)

Discussion:
- Jeff: missing 3 topics: 1) the RFC/SHA2 proxy issue, 2) batch systems, 3) cloud
  - Ian Bird: Ok. RFC plus glexec deployment are 2 issues that need followup. May be in the "Operations Coordination Team"
  - Michel: Next GDB there will be a report by Maarten on where we are wrt RFC proxies
  - Ian Bird: put cloud as a pre-GDB discussion forum for the moment
  - Michel: For the batch systems, there is a group in HEPIX. Should benefit from their work and not duplicate
- Wahid: also missing SRM. In the TEG report, the SRM functionalities required were identified. What is not in the report is the alternatives in which people is working now
  - Ian Bird: yes, most probably need a group for this
  - Philippe: bottom line: do we keep SRM? do we replace it?
  - Ian Bird: may be SRM will be more used for archive and not for data analysis systems (more towards xrootd there).
  - Markus: for the Managed transfers, we could work with gridftp only, for instance
  - Philippe: reporting space used/available is something we currently get from the SRM, and we need to keep this
  - Ian Bird: ACTION: Markus and Wahid will write down what the Storage Interface (avoid call it SRM) Group should do.
- Philippe: a more fundamental problem with these WG is the lack of a managerial body that can enforce what sites and m/w are doing/deploying.
  - Ian Bird: Indeed. This body does not exist. WLCG does not enforce, it needs to look for consensus
  - Jeff: how do we pass from recommendations to make it happen? We do not have an enforcement body
  - Ian Bird: The Operations Coordination Team proposed by the OPS TEG should be empowered by the MB to follow up on the deployment of all these

Storage Accounting Update, John Gordon

StAR: storage accounting record, proposed by EMI. Scope is limited to consumption of storage
- published version, link URL
- being implemented by EMI storage providers
Implementations
- All the EMI providers have implementation roadmap (see slides for timeline. Accounting sensors released in EMI3)
- Need to talk to other providers. John talked to Castor&EOS developers and they were not aware.
- Fallback solution: if native publishing is missing from storage providers, records can be published through Dashboard, Gstat or similar from bdii info
John presents a PROPOSAL: try and cover the WLCG use case in the Installed Capacities document (http://cern.ch/go/6JPK). See slides for proposed fields.
- Note that both allocated and used storage capacity will be accounted for. Also, we should make sure that SITE info is included in the record
- Need to decide what to publish. Eg: total allocated and used, per site, per VO, per month
- Need to decide what to view. Eg: min/max/average over time period
- Gstat is already collecting information from the BDII. We could do some reports to try and push the improvement in the quality of the data

Discussion:
- Ian Fisk: We use the CPU accounting to justify the usage of the CPU, not for decision making. We should scope for the storage accounting in the same way. For instance: do not invest time in per-DN/user accounting, since it is not used
- Michel: This service looks a bit far away in time. Should we think in some interim solution?
- Ian Bird: are the interfaces to the storage systems sufficient for the experiments to know the usage of the storage systems?
  - Philippe: We have an agent polling SRM every 10min. Use this for monitoring&accounting. BUT the only info that SRM is not providing is for tape usage. We have only the disk information
- Claudio: in CMS trying to use the bdii information and aggregate it in the dashboard. Unfortunately, most of the sites publish unreliable information
- Michel: what we do?
  - John: we should have prototypes and reports from a subset of sites and implementations
  - Maarten: we have to ask EMI about the time scales when we can start testing some of these implementations
  - ACTION: John Gordon agreed to ask EMI about these time scales

Information system, Maria Alandes

Caching mode in the bdii is available. It is the default config from the EMI1 update in March
- Deployment status: around 80 top-bdii (19 EMI, 17 gLite, 29 publishing wrong version information). We can not know anyhow if the caching is enabled
  - NOTE: it would be good to reduce the number of top-bdiis. Will check the status of old WLCG proposal in these lines
- Feedback from CERN site and experiments is that seems that bdii works better than before
- For sites: how to configure the caching mode is well documented: https://tomtools.cern.ch/confluence/display/IS
Remember a failover mechanism exists in clients (lcg_utils, GFAL ...): 3 top-level bdiis should be configured
glue-validator: new package in EMI1&2. Checks if the info published by a service is Glue compliant
- Site managers can start using it to check their info. Documentation: https://tomtools.cern.ch/confluence/display/IS
- Plan for the future: automate the run of glue-validator
Issue: glue-info-provider-service it is used by most services. The future support is not clear. Need to come to conclusion soon.
Glue 2.0 status:
- Benefits: is a simplified model, adds m/w version, support for multicore jobs
- Deployment status: now at 53%. Progressing very slowly
  - EGI has set a deadline for non-Glue2 site-bdiis: end Sep 2012
  - From EMI2, all services must publish in Glue2
  - Clients (ie GFAL) can not migrate until all info is published in Glue2
Evolution of the Information System
- Recommendations from OPS and WM TEGs (see slides). Long term plan: refactor the the IS into "three pillars": 1) service data (static); 2) state data (dynamic) and 3) metadata (quasi-static)
- EMIR, The EMI Registry: will address the service discovery problem. Distributed in EMI2. See main features in slides.
- "ginfo" is a client developed at CERN. Will be able to query EMIR and also the BDII. Candidate for replacing lcg-info and lcg-infosites
- IS future panorama (see diagram in slides): get rid of the site and top bdiis
  - EMIR: candidate for static information
  - For the state data: message based Resource Information System. Still a lot of open questions

Discussion
- Maarten: where is OSG? now they are in the bdii, we need to make sure we can do the aggregation at the WLCG level as well. May be we need to pass the requirement to EMI that they need to be compliant with OSG.
  - Markus: This is not that urgent. We still need to evaluate EMIR. If we are not convinced it is robust enough to be used as a core service, we will not go away from the caching bdii
  - Alberto: EMI and OSG are organizing some meetings do discusses issues like this. The issue is not being ignored.
- Davide: The WM TEG recommendation was to have as as simple as possible IS. We would like to implement multicore support quite early. Do we need to wait for Glue2.0?
  - Markus: sure it will take time to move to Glue2.0. But we can not patch the old glue. Maxcores, etc, needs to go to Glue2.0
  - A. Girolamo: from the experiment we would like to see the maxcores attribute to be associated to a CE+queue (GlueCE), or may be to a SubCluster (discussion)
  - Michel: This is something we will need to discuss in the technical groups dedicated to this (FOLLOW)

WN Security pre-gdb, Romain

Authentication (crls, etc) can not be used for banning. Because a user can be malicious without being compromised. Authorization is the correct way: removing from VOMS not enough, due to long lived proxies. Need central banning to ensure appropriate incident response
- Central banning deployment PROPOSAL: "All WLCG sites must implement necessary mechanisms to pull central banning lists from the central Argus instance, for example by deploying Argus locally when applicable. The deployment of these solutions should be followed up in the GDB"
Traceability: necessary for VOs and sites to collaborate to share information on this.
- Concern with traceability of data: e.g. storage. This needs to be followed up.
Virtualization on the WN: A WG should be appointed to make recommendations for sites to fulfill the logging and traceability policy on the WN (FOLLOW)
Using external clouds: There are significant security concerns in using external cloud providers. A WG should be appointed to understand the policy & operational issues it raises (FOLLOW)
- Discussion:
  - Philippe: Why do you care about WLCG policies at sites that have nothing to do with WLCG?
  - Romain: The WLCG policy says that if you provide resources to the Grid, you should be compliant with the WLCG security policies
  - Ian Fisk: Need to get this clarified. We have lots of examples of sites that "burst" to 3rd sites. May be opportunistic resources, but it will be public clouds at some point. What is the worry?
  - Romain: The entire security model has been based on the fact that we control the resources.
  - Ian Fisk: Key issue is to realise who is taking the risk. A site decides to make their pledge through amazon: the contract is between the site and Amazon. The site needs to be aware
  - Conclusion: Some discussion is needed. Agreement that it is all about risk. Make clear which are the risks and who is taking them (FOLLOW)

Proxy lifetime: can we reduce the VOMS proxy lifetimes? consensus is that this is a good idea, but followup at the GDB is needed at the technical level (FOLLOW)
- Philippe: why do you want to enforce this? it makes life more complicated!
- Romain: it is always a compromise between security and ease of use

Pool account recycling: PROPOSAL to recycle pool accounts only after they have been unused for 6 months.
- Michel: how do we do this? ask EMI to configure this by default, or chase the sites?
  - Maarten: both. Maarten: ACTION to pass this requirement to EMI.

Very long term future: Identity Federation, i.e. having a model where x509 would be hidden from users: No conclusion, no proposal, just highlighting that it will be worth to participate in these discussions in the future

EMI News, Cristina

See details of EMI updates and timelines in the slides
EMI2 released 21st May. Full support for SL5 and SL6 64bit. Partial Debian6 support.
- 5 new products: CANL, EMIR, EMI-Nagios, Pseudonimity, WNoDeS
- Full integration of ARGUS authz in the CEs and Storage.
- Better support of Glue2 in all relevant services.
- Initial implementation of the EMI Execution Service interface in all CEs
- Support for NFS4.1, http, webdav in the SEs
- Messaging-based SE-FileCatalog sync service (SEMsg)
EMI updates schedule: continue providing continuous delivery of EMI1 and EMI2 updates. 1 update cycle/month.
Planned product updates: top bdii, blah, dpm and lfc, gfal/lcg_utils, WMS (see version numbers and details in slides)
Plans for Y3: UI/WN tarballs, Debian 6 porting (aim is to have it by the end of October), Java APIs on Maven Central, release into EPEL/Debian repositories
Discussion
- Michel: is the WN in SL6 ready? Cristina: yes, in EMI2, but 32bit binaries are missing
- Markus: proposes that the 5/6 sites which have deployed special queues to test the EMI1 WNs, should move now to test the EMI2 WN.
  - Maarten: the tests from ATLAS and CMS have not reached a conclusion yet. So propose to keep the test EMI1 resources as they are.
  - Markus: better to put the effort in testing EMI2.
  - Maarten: do we want to send the message to sites: do not upgrade ever to EMI1, go directly to EMI2?
  - Markus: may be yes. Otherwise there is the risk we will never have time to certify EMI2 before the end of the project
- Cristina: confirms that can run EMI2 WN works with EM1 CREAMCE

Globus s/w support at OSG, Brian

The grant providing globus support dedicated for OSG expires this summer
Organizations with interest (i.e. WLCG) should push to get commit access
- Ian Bird: we did this 10 years ago. Maarten was allowed to commit. It did not work very well. Then we got in collaboration with VDT to do this. Not sure if this will work
Technical dependencies:
- GSI and GridFTP: widely used. For gridftp alternate implementations and alternate solutions exist => Not concerned
- GRAM: not widely used at our scale. Bugs discovered weekly => we are concerned! Alternate exists, need some time to investigate them.
  - OSG CE: OSG working to investigate the possibility of switching the OSG-CE to using CREAM or Condor-CE technologies. The OSG Executive Team will make a decision. Will know about 1 or 2 weeks after July GDB.
    - Note there are real impacts of adopting one of these. E.g. could be that if we adopt CREAM, direct gliteWMS submission to OSG will be dropped.
      - Maarten: moving to CREAM does not break the gliteWMS, otherwise EMI sites would not work
      - Brian: OSG has not concrete plan (yet?) to provide support for Glue2.0 (Maarten asks for clarification)
      - ACTION: to clarify with OSG what are their plans regarding Glue2.0

EMI Sustainability Plans, Alberto di Meglio

Software products
- Clarification: EMI is a collaboration project (collaborating product teams). The end of EMI is not the end of the product teams.
- The list of sw products has been circulated: contains all the products, with partners committing effort to each product. This list now can be make public, and will be circulated.
- Few gaps have been identified * Yaim: there is no plan in EMI to get out of yaim. However there is not much enthusiasm supporting yaim beyond the end of EMI. This is an open issue. * RAL-SAGA-SD: Will not be supported. But from this morning presentation seems that there is an alternative: ginfo * also mentioned as minor problems: delegation java, torque config
Global Tasks
- Build: ETICS is not required anymore for building, from EMI2. Instead, standard tools are used (mock, pbuilder)
- Certification: now it is done by PTs, but testbed not really large-scale. This is an open issue
Coordination: being discussed currently. The plan is to converge by September. One idea is to do it similar to ScienceSoft. Still under discussion

Collaborations:
- WLCG: proposal to have a meeting among WLCG, EGI, EMI. The date of the meeting still needs to be found that it fits all the people that need to attend.
- EGI: same as above, plus discussion regarding non-HEP
- OSG: meeting being planned for end June to discuss common activities, etc.

Discussion
- Michel: as WLCG, it is very important for us to make the list of products outside EMI which are important for us (dashboard, etc) which are related to the EGI HUC effort which will also end
- Ian Bird: the outcome of the above WLCG-EMI-EGI meeting needs to be how do we manage software in the future, also to discuss: how do we do certification, stage rollout and deployment in general.

Communicating Machine features to batch jobs, Tony Cass

This talk was already presented in the GDB 14 months ago.
The HEPIX VWG proposed the creation of 3 files in /etc/machinefeatures for communicating VM features to jobs (see slides for details). Felt useful also for real machines
Implementation status
- LSF: done and implemented at CERN
- PBS: NIKHEF has started work to deliver the scripts
- GridEngine: First contact done with IN2P3
Discussion: how to make progress?
- Philippe: Ok. Let's try and test it first at CERN.
- Claudio: can we add new features like the multicore ones, since this mechanism is already agreed? Tony: the proposal was meant to be extensible.
- Discussion on whether hs06 file should contain the HS06 per core or per job slot.
  - MJouvin: cuts the discussion. Propose to stick with what we have. Get back to the discussion in the future, when there is something deployed (FOLLOW)

gLexec update, Maarten Litmaath

A few more sites now claim glexec support in the bdii (MyWLCG monitoring available at http://cern.ch/go/PX7p)
Documentation on glexec deployment: https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeployment
Tier1/Tier0 status for glexec tests:
- OPS: BNL not yet setup
- CMS and LHCb: all sites ok
- ATLAS: configuration issues for BNL and IN2P3 being addressed

Experiment plans:
- CMS: started opening tickets where glexec does not work
  - Claudio: main reason for CMS pushing this is the security challenge in July. In the long term, it should be WLCG who asks the site to support glexec
- ATLAS: working on glideinWMS backend for Panda, running glexec tests since few weeks
- ALICE: will look into implementing special proxies with critical extension only understood by glexec. This will require some development
- LHCb: will check again if glexec support in DIRAC needs adjustments

Michel: suggests to send e-mail to the CB list for the site representatives to pass the message to the technical people that glexec should be installed and configured at all sites.

Federated Identity pilot, Romain

Federated Identity Management document (vision, requirements and recommendations) endorsed by MB on 5th June: https://cdsweb.cern.ch/record/1442597
Need to decide on a pilot project: non-browser based. Services enabling WLCG resources using home-issued generated credentials (see slides for diagrams)
- Proposed plan: 1) proof of concept; 2) architecture design; 3) pilot service
CALL for interested experts, sites, VOs to join this effort.
Discussion
- Alberto: STS (Security Token Service) it is a translation service between different credential systems. The implementation of STS in EMI is late. Asks if there is room for collaboration, and in particular if the effort in EMI can be stopped
- David Kelsey: we should add STS to the list of possible solutions, and EMI STS experts are more than welcome to join this effort

-- GonzaloMerino - 14-Jun-2012

Topic revision: r4 - 2012-06-26 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback