LCG Web>WLCGGDBDocs>GDBMeetingNotes20130612 (2013-06-15, MichelJouvin)

EditAttachPDF

Summary of GDB meeting, June 12, 2013 (CERN)

Agenda
Welcome - M. Jouvin
Security Update - D. Kelsey
HEPiX report - H. Meinhard
Actions in Progress
- Ops Coordination Report - A. Forti
- Deployment Machine/Job Features - S. Roiser
T0 Update - H. Meinhard
Site Monitoring and Alarming: New monitoring and alarming needs related to VO Nagios tests
- Site Notifications with SAM and Dashboards - M. Babik

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=197804

Welcome - M. Jouvin

Upcoming GDB meetings:

No GDB in August
No changes in the meetings agenda till the end of the year
pre-GDB to be confirmed for the upcoming meetings
GDB outsite CERN: In general seems to be a good idea. It would be interesting to make meetings outside CERN more attractive (keynotes, site demonstrations...) It is an opportunity to visit T1 and T2s. The next meeting outside CERN is very likely to take place in November in Copenhaguen. To be confirmed in the next GDB.

GDB topics:

Contact Michel to propose topics to be presented in the next GDBs.
See slide to know topic list for GDB July meeting.
See slide for topics for upcoming pre-GDBs (cloud, ops cordination, maybe batch system support?)

Actions in progress:

Tracked in twiki: https://twiki.cern.ch/twiki/bin/view/LCG/GDBActionInProgress
StAR Implementation: T. Ferrari mentions that a deployment plan for storage accounting and usage record publishing will be discussed with sites. More than 50s are already publishing StAR records. There is already a production accounting DB that can collect accounting information. There will be a report about this particular action in July GDB.

See slide about forthcoming meetings (LHCOPN/ONE, CHEP, EGI TF, HEPix ...)

Security Update - D. Kelsey

IGTF news

EUGridPMA meeting on 13-15 May 2013: https://www.eugridpma.org/meetings/2013-05/eugridpma-kyiv-summary-20130515.txt

SHA2 time line agreed

Now: IGTF CAs issue SHA1 certificates by default. They may issue SHA2 certificates on request
October 2013: IGTF CAs should issue SHA2 certificates by default and should begin to phase out SHA1
April 2014: New CA certificates SHA2 only. Existing CA certificates re-issued with SHA2. Existing root CA certificates continue SHA1
October 2014: CAs start to publish SHA2 CRLs
December 2014: SHA1 certificates expired or revoked
Discussion:
- T. Ferrari asks whether we want to make the replacement of the middleware not compliant to SHA2 mandatory or whether we leave freedom to sites whether they want to upgrade to SHA2 or not. There are middleware product versions still supported (until April 2014) but not SHA2 compliant. D. Kelsey suggests that SHA2 compliance should be mandatory if there are SHA1 vulnerabilities that can be exploited. M. Litmaath adds that vulnerabilityes in SHA1 are already published and difficult to exploit and he suggests to apply a pragmatic approach understanding which services already work with SHA2.
- M. Jouvin asks whether SHA2 is supported in EMI 2 or only on EMI 3. M. Litmaath replies that support in EMI 2 is available in most services. T. Ferrari adds that dCache has no compliance yet. If we want SHA 2 compliance before 1st October, we need to get some feedback on what to do by the end of June.

IGTF IOTA profile: T. Cass asks how the common name will be chosen? i.e. Passport Name? D. Kelsey answers that it is under discussion but that there are clear requirements for the common name (See slide 8).

Revised Security Policies

Site Operations Security Policy renamed Service Operations Security Policy to recognize that not all services are run by sites
- Now required that every institution running services must run automated procedure to download and implement user suspensions
- Now addresses end of security support for a SW and retirement of a service
- Already adopted by EGI, would like to see adoption by WLCG
Grid AUP: EGI council decided to require its users to acknowledge support and resources used, requested that this should be in the AUP
- Says "all papers, preprints, talks...": may be a bit excessive... to be refined...
- Difficulty: need support from VOs, not only from EGI... Users don't register with EGI but with a VO. Agreement is between EGI and the VO.
- Also need endorsement by WLCG: in fact experiment responsibility...
- D. Kelsey asks for feedback on the policy changes. I. Bird asks how will users actually acknowledge in practice all the support and resources used. I. Fisk asks whether all papers and presentations will need to include this acknowledgement. It could be a potentially long list. D. Kelsey explains that this text has been requested by the EGI council and has been reused from the one used by XSEDE project. M. Shultz asks whether this is different from the acknowledgement that it is included in EC project grants. I. Bird asks whether acknowledging WLCG should include all the rest.
Security for Collaborating Infrastructures (SCI):
- Trust framework involving EGI, OSG, PRACE, EUDAT, WLCG, XSEDE. http://www.eugridpma.org/sci/
- R. Wartel mentions that it'd be useful to schedule a more detailed presentation of SCI at a future GDB and/or MB.
- Security Operations after end of EGI-INSPIRE: T. Ferrari explains that in recent conversations with Kostas Glinos (European Commission), he expressed the interest of the EC to support integration activities across DCIs. Security is one of the integration activities that has been identified. So it seems there could be possibilities for funding in Horizon 2020. EGI has started to think of a plan B, i.e. asking for a membership fee plus external funding to deal with time consuming tasks like incident response.

HEPiX report - H. Meinhard

Last meeting: April 15-19, Bologna

83 attendees
70 presentations from 40 countries
Many unformal contacts in addition
Many active tracks

Batch systems review almost everywhere

Univa emerging as the major actor in the GE world
Several sites mentionning SLURM evalutations

New solutions emerged with probably a major role in the future

Cloud: OpenStack
Configuration management: Puppet

Topics to be monitored, early presentations

Distributed file systems: Ceph
Non x86 platforms

Several focused WGs

IPv6: neeed to speed up support, support for IPv6-only WN seen as needed by mid 2014 (in particular for CERN)
- Some problematic products: all batch schedulers, openAFS (see slides on session on OpenAFS/IPv6), ActiveMQ
- Most testing currently happen on SL5
- 2013 goals: increased participation, agreement on dual-stack services
Storage: now competed, final report at next meeting
Benchmarking: on hold, waiting for new SPEC benchmarks
- Increasing reports that current benchmark doesn't reflect properly new HW power
Configuration Management (Puppet): just started, led by Ben Jones (CERN) and Yves Kempf (DESY)

AFS: still a core service at many sites but not foreseeable IPv6 support

Not seen as a burning issue so far but not clear how to deal with it, follow-up decided

Next meetings

Fall: Ann Arbor, October 28-Novembr 1
Spring: Anncy, May 19-24

Actions in Progress

Ops Coordination Report - A. Forti

CVMFS:

v 2.1.10 released on 23.05.2013. I. Collier adds that it was retired immediately from the repository due to a bug (Unfortunately some sites installed this version)
v 2.1.11 under test to be completed in 1 or 2 weeks.
- Release to be used by SL6 sites
- Provides the RW fix needed by Atlas
- If other sites are ready to join they are welcome but should coordinate with the TF
New SAM/Nagios probe, currently used in pre-prod by LHCb and CMS
LHCb and ATLAS: deployment completed
CMS and ALICE set a deadline to April 2014
ALICE currently testing, initial deployment in July, use in production during Fall

gLexec

Ramping up campaign to be organised: https://twiki.cern.ch/twiki/bin/view/LCG/GlexecDeploymentTracking
Tests will become critical after a target date in Autumn

perfSonar

v3.3 under testing with positive results. It will be the recommended version when it is released
New dashboard for WLCG supported by OSG to be discussed
- TF is working on how to publish meshs (use the mesh configuration) in it
Nagios tests under discussion with EGI

FTS 3

Demo held May 22, first release expected in one month
Deployment model under discussion with developers to support shared configuration and scheduling information
CERN and RAL investigating FTS 3 as a production quality service
No deployment schedule ready yet
Discussion:
- M. Jouvin asks whether the deployment model is now clear: only some FTS 3 sites and not deployed everywhere?
- M. Girone explains that this is an experiment decision. What has been asked to FTS 3 TF is to prepare a deployment plan that it is simple and easy for all sites.
- I.Fisk says deployment model is pretty simple, one FTS 3 per T1. This has to be tried in production and see how it behaves. Ideally we would like to shut down FTS 2 services using Oracle.
- I. Collier mentions that at RAL, FTS 3 is running MySQL. It will be upgraded to production state very soon.
- S. Roiser mentions that LHCb prefers to run as few instances as possible, production one and the backup.

SL6 migration

T1s Done: 6/15 (ALICE 4/9, ATLAS 4/12, CMS 2/9, LHCb 3/8)
- CERN: lxbatch = 8K slots, lxplus = doubled capacity (66 4-core/8GB VMs)
  - Intermittent login problems at lxplus
- BNL and NIKHEF: bing bang migration done end of May
- NDGF almost complete
- CCIN2P3: will migrate 25% of their resources next week
- RAL: 10% of the CPUs in a new CE (ARC) with a new batch system (Condor), used for evaluation
- V. Sapunenko says that CNAF is planning to move to SL6 in July, after new HW delivery.
T2s done: 20/131 (Alice 8/41, Atlas 7/91, CMS 11/65, LHCb 4/45)... need to ramp up
- Updated migration procedure with extra links for ATLAS: in particular CVMFS issue due to the unusual handling of RO file systems
- New EHPOS_Libs (1.0.10) needed by ALICE

SHA-2: testing instructions are now available

All experiments having some activities on this topic
EGI also involved in testing
- All EGI core services are SHA-2 ready
- Migration of ops VO from VOMRS to VOMS-Admin urgently needed but no timeline defined yet
No strong deadlines set yet but goal is during the autumn
Most EMI-2 services require an update

UMD 3.0.0 released: an issue being investigated for getting proxies from BNL

Monitoring

Upcoming SAM Update 22 with more metrics affecting WLCG critical tests

Tracking tools

Savannah to Jira migration: migration steps have been updated
TEAM/ALARMS belonging to more than one experiment: too much a big effort to implement, a workaround is in place

Deployment Machine/Job Features - S. Roiser

Original ideas contributed by HEPiX Virtualization WG

Uses two environment variables pointing to directories containing 1 file/feature: $MACHINEFEATURES and $JOBFEATURES
See Twiki page for details

LHCb is proposing one interface usable by all VOs to provide individual information about the actual WN to a job

Can be used by a VO pilot to customize/select the payload appropriately
Should work both in a regular batch system and in virtualized environments
Ideas taken from the Poncho system developed at Argone for communication between a VO and an Iaas infrastructure
- Bidirectional: VO pushed requirements, infrastructure asks a user to shutdown an instance
- Probably don't need the bidirectional mechanism for grid or our usage
Use cases: get info on cores allocated, get the HS06 power, "backfilling" of VM machine where you cannot run a new normal payload with some useful work (MC for example), notify VMs to shutdown (eg. in HLT farms), rebalance VO shares...
Proposal: provide a standard API (or script) to the VOs that return the information in a standard format independently of actual format used to store the format
- Use a JSON format for a machinefeatures file

What next

Specify syntax and possible values
Interface with IaaS systems
Provide missing implementations
- Batch: Condor, SLURM
- IaaS: OpenStack and others
Push for deployment: an Ops Coord TF?
- TF will not take care of the development
- Concentrate on the multi-core support

Discussion

J. Templon mentions that upgrades in PBS and TORQUE are trivial for SARA and NIKHEF. S. Roisser mentions that in his experience upgrades are not always easy.
I. Fisk mentions that from the VO perspective it is nice to have this information, like the number of cores. Maybe it is worth investigating how this is done in other contexts, like Amazon EC2. He is skeptical about usefulness of SpecInt information that could put an additional load on sites for litte added value to experiments: experiments are not necessary interested in generic benchmark, easier to run a short program representing the VO benchmark. M. Jouvin says that having a "standard value" per worker node published by the site is not necessarily very difficult as long as a high precision is not required: the site can publish it and the experiment ignore it and use its own benchmark.
M. Girone asks who will be driving this effort. It seems it is interesting to LHCb. What about the other experiments? S. Roisser answers that the interest from other VOs needs to be understood to decide whether this can be started as an Operations TF.
C. Walker asks about the multicore support use case and why we would need this new mechanism to learn about it. S. Roisser says it is not clear how to address this use case yet.
I. Fisk explains that CMS is planning to address the multicore use case by sending a pilot and learning what it is available in the machine and then they will choose the type of jobs that better match the available cores. He adds that when we move to pilots it's difficult to know what it is needed at job submission time, so it is better to discover at runtime what are the resources available (late binding).
M. Jouvin asks whether there is interest in other VOs to try this.
- I. Fisk says there is interest as long as this is a coordinated effort and sites are also involved.
- Ueda says Atlas is also interested by this feature if widely available
J. Templon says that he would like to keep the site part of this as lightweigth as possible. At the same time, sites need to be involved to understand the experiment requirements.
M. Girone explains that it should be clarified who has to coordinate this, should it be at Ops coordination? T. Cass adds that this was discussed in the MB. It was agreed that Batch system representatives will take care of the implementation details.
S. Roisser asks whether any LSF sites have been involved since LSF already provides this implementation. U. Schwickerath confirms so, and he adds that there are site specific implementation details. it is not as simple as installing a few packages.

T0 Update - H. Meinhard

Agile Infrastructure project

Moving to production, putting effort on stability and scalability
- Now run by OIS
- Configuration management run by PES
- Monitoring in facility group
VM provissioning using 'Ibex' based on Openstack Folsom (Grizzly upgrade on-going)(cattle vs pet approach, this cover the cattle use case)
Active contributers to Openstack community making impact in some development decisions of the project
CEPH and Owncloud under investigation

Wigner CC

Official inauguration on 13.06.2013 with CERN management and Hungarian authorities
Operational 2x100 Gbps links but less stable than expected
Equipment installed and running: 80x4 dual-CPU, 80 SAS boxes (24x3 TB)

CERN CC

Barn in building 513 hosting critical equipment officially inaugurated in 07.05.2013
Servers and storae installed, services moving over
Fire in building 513 caused significant smoke damage and left physics equipment without UPS for several weeks

SL6 Migration

Alias migration took place on 06.05.2013
Batch capacity level in SL6: 15 % (after alias migration, jobs that do not specify platform end up on SL6)
Technical issues:
- Sssd crashes preventing logins
- SL6 VM not perfectly stable
Comments:
- C. Walker asks whether it is a good idea when WNs are moved to SL6, to upgrade also to EMI 3. M. Limaath replies that there should be an exercise to validate EMI 3 WNs first (CERN migrated to SL6 with EMI-2).

Services

WMS: successfully migrated to EMI-3 but still SL5
Grid services either in EMI 3 or in the process of upgrading to EMI 3 (see slides for all the details)
Batch services: security vulnerabilities addressed, SLURM under evaluation
VOMS: preparing to upgrade to 3.0.1, validating that it will allow to phase out VOMRS
FTS3: preparting to move from development to production
Many services very up to date with last releases available
Version control, issue tracking: Git established and popular (231 projects), Jira well received (117 projects)
CERN CA: SHA2 being tested
Databases: Oracle license and support offer has been approved by CERN finantial committee for 2013-2018. Significant cost. May need to consider migration to other RDBMS for 2018 but not urgency and no need to think about it right now.
- All WLCG sites can use a bundle of Oracle packages but a significant cost for CERN: something to be revisited before next renewal
- "Lost write" issue finally solved and tracked down to a NetApp firmware bug

Site Monitoring and Alarming: New monitoring and alarming needs related to VO Nagios tests

Site Notifications with SAM and Dashboards - M. Babik

WLCG reporting proposal:

Currently:
- Weekly reports to MB: T0/T1
- Monthly reports to MB, CB and GDB: T0/T1 (OPS and VO secific) T2 (OPS)
- Quarterly: compiled manually from monthly reports
Proposal:
- Monthly: Joint T1/T2 reports (VO specific, remove OPS)
- New reports on T2 performance (critical tests defined by experiments: http://cern.ch/go/H9hj)
  - CE, SRM, LFC and WN
  - Operational tests (developers by PTs) and experiment tests
Discussion about Sites vs Experiment tests
- J. Templon mentions that it is not clear what site tests and experiment tests means. We have to be clear with the wording.
- C. Walker mentions that sites need to understand what the test is doing so in case of failure they know where to look and try to reproduce it. M. Babik explains the experiment tests should be very well documented. M. Litmaath mentions that generally it is difficult for a site manager to reproduce a test as it often involves experiment credentials. In that case, the debugging should be done on the experiment side.

Impact of the move to VO tests on site notifications: discussion needed to better understand what is really needed by sites

Currently tests runs on experiment testing infrastructures and aggregated at SAM central per-VO server to feed in particular WLCG dashboards
- SSB is the key dashboard piece to have the site view, with some notification capabilities
- Experiments do a manual notification with their computing shitfs or operation team, often relying on SSB and sometimes cross-checking with their own tools (eg. DIRAC for LHCb)
Notifications go through ROD/COD
- ROD: Regional Operator on Duty receives notifications from regional NAGIOS and process them in the EGI Operations Portal contacting sites using GGUS
- COD: Central Operator on Duty, oversees ROD and gets notified when alarms not handled in 72h or GGUS tickets opened for more than 1 month
With Nagios, an extension is available to integrate data from regional Nagios in the site instance: Site Nagios
- Provided/supported on a best-effort basis
- Native Nagios notification can be used to fine-tune notification per service
- A message bus can be used to achieve the same goal
Filtering at experiment Nagios remains complex and generally results in many notifications being sent
- Notifications accessible through the messaging infrastructure
SSB (Site Status Board): monitoring and notification at the experiment level, site admins can subcribe to notifications
- Also provides detection of missing results that are not available with Nagios

WLCG monitoring project started with aim of simplifying the current monitoring infrastructure

Discussion

J. Templon explains that site NAGIOS is used at Nikhef.
J. Flix mentions that site NAGIOS is also used by PIC and it is very convenient since everything is integrated in one tool.
V. Sapunenko reports that site monitoring in CNAF is done using site NAGIOS integrated with SAM results.
R. Vernet explains that for IN2P3-CC, Nagios is very useful for notifications. Site tests only for the time being, it would be good to get notifications from experiment NAGIOS but no knowledge on how to activate this. M. Babik gives some insight on how to do this.
J. Templon shows the dashboard used at Nikhef. He also mentions that when message bus is down there is a problem for many things, not only Site Nagios so we should not worry to much about this. In fact a very reliable service.
E. Imamagic explains that the deployment is very simple installing a few packages that allow you to plug into an existing system: regional, experiment, etc.
J. Coles asks what happens with the computation of availability and reliability if the experiment Nagios is down. M. Babik replies that nothing is measured, no data. There is a data loss. It is a rare case though.
M. Jouvin asks who is using SSB notifications? J. Flix explains that the people in CMS in PIC get notifications from both NAGIOS and SSB, especially SMS notifications.
M. Jouvin mentions that site NAGIOS support is on a best effort, would the site NAGIOS team be happy to see its usage increased? Be ready to put more effort on the support? M. Babik says that yes, the team would welcome an increased usage of Site Nagios and that it may be something for the common monitoring project.

Site Nagios documentation

Part of the SAM documentation.
In particular, see https://tomtools.cern.ch/confluence/display/SAMDOC/Installing+SAM+Update-17.1#InstallingSAMUpdate-17.1-SiteNagios

Conclusion: sites interested are encouraged to test Site Nagios and to report feedback

NIKHEF and PIC will be happy to share their configuration/experience

Topic revision: r4 - 2013-06-15 - MichelJouvin

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback