LCG Web>WebPreferences>WLCGOpsMinutes151119 (2018-02-28, MaartenLitmaath)

EditAttachPDF

WLCG Operations Coordination Minutes, November 19th 2015

Highlights

All sites must patch their hosts for the NSS vulnerability as soon as possible, if they have not done so already.

Agenda

https://indico.cern.ch/event/393620/

Attendance

local: Maria Alandes (chair), Andrea Sciabà (minutes), Maarten Litmaath, Maite Barroso Lopez, Andrea Manzi, Marian Babik, Alessandro Di Girolamo
remote: Alessandra Doria, Michael Ernst, Jeremy Coles, Christoph Wissing, David Cameron, Raja Nandakumar, Renaud Vernet, Thomas Hartmann, Dave Mason, Vincenzo Spinoso, Alberto Aimar, Peter Gronbech

Operations News

Middleware News

Useful Links:

Baselines:
- dCache 2.6.x decommissioned deadline was end of September. 11 instances are still running, 6 f them used in WLCG. i have discussed with EGI to open tickets to the sites still running old versions. https://wiki.egi.eu/wiki/Software_Calendars#Decommissioning_Calendar_dCache_2.6.X

Issues:
- Critical Vulnerability broadcasted by SVG on Friday 06 affecting NSS. (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2015-CVE-2015-7183). All software where the SSL handshaking is based on Mozilla Network security services which includes RedHat 6 and 7 and its derivatives is affected ( for instance libcurl uses NSS). EGI CISRT put as deadline the 2015-11-13 for patching the hosts. Sites failing to act and/or failing to respond to requests from the EGI CSIRT team risk site suspension.
- this is a problem affecting not only grid services, Security team @CERN has also sent this week an email to ask all service admins to patch their hosts

T0 and T1 services
- KIT
  - dCache upgraded to v 2.13.9
- CERN
  - Every EOS deployments upgraded to EOS 0.3.135-aquamarine
- JINR
  - dCache upgraded to v 2.10.44

Tier 0 News

The LSF 9 upgrade of the WNs is in QA testing. The ATLAS Tier-0 LSF instance is upgraded to v9, and the clients are also in QA, Atlas will decide when to upgrade them..
The HTCondor capacity represents some 5% of the total batch capacity at CERN; we plan to rather quickly move more resources from LSF to HTCondor to reach some 20…25%. The two ARC CEs are declared obsolete;
The Kilo-1 configuration that resulted from performance optimization work jointly done with the cloud team is now running on some 100 lxbatch hosts, so far with very satisfactory results and no indication of any unwanted effect. It will be extended to all hosts when the Openstack kilo release is deployed, estimated at the end of November.
IPv6 enabled in MyProxy and VOMS for testing purposes, in dual-stack mode (IPv4 and IPv6).

Andrea S. reports that VOMS does not work over IPv6 and he got confirmation from the main developer. He will open a GGUS ticket for this problem.

DB News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

generally normal to high activity
preparations for heavy ion reco jobs:
- important changes in the code and workflow have been implemented to reduce the memory usage
- they were tested with 2011 heavy ion reference data
- if all goes well, for this year's heavy ion data the reco jobs will only need ~2.5 GB RAM
- to be on the safe side, special arrangements were made with the sites that will receive heavy ion raw data
- CNAF, KISTI, KIT and SARA have set up dedicated high memory queues
- at CERN the jobs can request 2 cores and hence have twice the memory
- all setups have been tested with normal jobs
- we thank the sites for the good support!

Maarten adds that only real data taking will show how often events requiring a lot of memory will appear. It is understood that requesting two cores per slot will heavily affect the job CPU efficiency, but this is the price to pay. It might be that at CERN this will not be required but it is early to say.

ATLAS

Activity as usual
- new record in parallel running slots: 250k . Thanks to the impact of opportunistic resources like Sim@P1 and NERSC_Edison (together they contributed with more than 50k slots)
Frontier and Squid: during the past few days we observed that some of the jobs we are running now (mc15b campaign) are requesting an excessive amount of conditions data. This is creating troubles so some squids and Frontier servers. The problem is understood and fixed, no new tasks like this will be launched. For the existing ones, since they are almost over, we will let them finish
Heavy Ion data taking: we are ready for it. Since the processing time of HI is huge, we are ready to use the Tier1s/Tier2s to reconstruct also.
Deletion agents: deletion agents were switched off between Sunday night and Wednesday, to allow time to recover data which was scheduled for deletion but was actually needed by some people. Now the deletion agents have been restarted, but they are struggling to keep on with the high amount of deletions.
PRODDISK has been decommissioned on all the Tier2s (and Tier3s which wanted).

CMS

Preparations for Heavy Ion running continuing
- No issues so for from the Computing side
Very high load in the system
- Last week sustained ~120k parallel jobs
- Multi-billion events MC RECO campaign ahead
- Situation expected to stay like this for weeks

LHCb

Operations
- Very high activities on distributed computing resources with user and simulation workflows
- Some low levels of Data processing activities ongoing
- LHCb will participate and take data in lead-ion runs until mid December
Issues
- Several days of failures at SARA when srm was overloaded by a local user.(GGUS:117413, GGUS:117483)
- Issues with tape movers at RRCKI (GGUS:117444, GGUS:117267)
- Security vulnerability reported with LHCb setup script in CVMFS which is sourced before every workflow. Under investigation.
Development / Outlook
- Working on interface to HTCondor-CE

Raja and Maarten clarify that ALICE and LHCb are both in the same situation: their HTCondor-CE plugins are basically ready but not yet in production. Alessandro adds that ATLAS already submitted jobs to the CERN HTCondor-CEs and they are ready to be put in production.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

HTTP Deployment TF

The 5th TF meeting took place on 11th Nov - https://indico.cern.ch/event/459419

Minutes are attached to the agenda.

The TF now has a working Nagios probe, endpoint lists from the experiments, regular monitoring of the infrastructure (see links on agenda) and a GGUS support unit. The TF is thus ready to do a "dry run" of its principal activity, helping sites to get their HTTP storage in shape. In the next couple of weeks we will run with a small group of volunteer sites to test/optimise the process which will then be used to ticket and support all remaining sites.

Information System Evolution

The first draft of the Future Use Cases document is now available for comments. Deadline to provide input is on 24.11. The document will be presented at the December GDB.
There was a TF meeting on 12.11 ( Minutes). All the experiments presented their plans to move to GLUE 2.0 and proposals to simplify the interactions with the IS. Several action items were defined after the meeting:
- Define a roadmap to stop publishing GLUE 1.3 in coordination with EGI and OSG.
- Information validation:
  - Document existing validation mechanisms (this is now documented in the TF wiki)
  - Actively validate information that is important for WLCG. Feedback from experiments is needed (especially ATLAS). In particular, validation of the Waiting Jobs GLUE attribute for ALICE has been implemented ( SSB).
  - It was agreed that after the feedback collected so far, it doesn't make sense to define a GLUE 2.0 profile for WLCG.
  - There are ongoing discussions with MW officer to integrate glue-validator within the different services running a resource BDII and improve information quality before it gets published. This will be proposed at the URT meeting on 14th December.
- Study the proposal of publishing a subset of the current GLUE schema that is useful for WLCG in JSON/HTTPS. Andrew McNab presented his work on publishing Vac/Vcycle resources using this approach.
Next meeting is on 26.11 ( Agenda)

IPv6 Validation and Deployment TF

Middleware Readiness WG

DPM 1.8.10 installed and verified @ Edinburgh for ATLAS
dCache 2.10.44 installed and verified @ TRIUMF for ATLAS
EOS testing @ CERN is paused. The new version Citrine has been installed in pre-prod, but is not yet ready for testing.
BDII verification on CENTOS7 will start next week @ Brunel
MW readiness app v0.3 deployed in prod https://wlcg-mw-readiness.cern.ch/releasenotes/
Next meeting Wednesday 2nd December at 4pm CET.

Multicore Deployment

Network and Transfer Metrics WG

perfSONAR collector, datastore, publisher and dashboard in production (stable operations)
Additional monitoring metrics will be added to psomd.grid.iu.edu to capture collector's efficiency and report on freshness of the metadata in the OSG Datastore (for each sonar).
perfSONAR 3.5: 205 sonars were updated, ALL sites are encouraged to enable auto-updates for perfSONAR
Pilot projects: ATLAS Panda, perfSONAR stream now in ATLAS Network Analytics (https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ATLASAnalytics), several KIBANA dashboards available - Site link stats. Jorge and Ilija working on cost matrix using the round-trip time and packet loss in Mathis's formula to infer bandwidth (predictions based on this model will follow).
Pilot projects: LHCb DIRAC bridge is now functional, processing perfSONAR stream and inserting packet loss metrics in DIRAC, includes mapping to LHCb sites. Henryk, Federico and Stefan are working on this.

RFC proxies

Squid Monitoring and HTTP Proxy Discovery TFs

Nothing to report

Action list

Creation date	Description	Responsible	Status	Comments
2015-06-04	Status of fix for Globus library (`globus-gssapi-gsi-11.16-1`) released in EPEL testing	Andrea Manzi	ONGOING	GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 15 GGUS tickets opened for SRM and Myproxy certificates not correct, 6 already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well)
2015-10-01	Follow up on reporting of number of processors with PBS	John Gordon	ONGOING
2015-10-01	Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites	SCOD team	ONGOING	A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting

Maarten adds that, concerning the host certificate issue, it is under control or already solved for all sites but he is still awaiting for feedback from France and he will ping them again.

Specific actions for experiments

Creation date	Description	Affected VO	Affected TF	Comments	Deadline	Completion

Specific actions for sites

Creation date	Description	Affected VO	Affected TF	Comments	Deadline	Completion
2015-11-05	ATLAS would like to ask sites to provide consistency checks of storage dumps. More information and More details	ATLAS	-	-	None	-

AOB

Andrew McNab will take over the coordination of the Machine/Job Features task force.

-- AndreaSciaba - 2015-11-17

Topic revision: r13 - 2018-02-28 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback