WLCG Operations Coordination Minutes, November 19th 2015

Highlights

  • All sites must patch their hosts for the NSS vulnerability as soon as possible, if they have not done so already.

Agenda

Attendance

  • local: Maria Alandes (chair), Andrea Sciabà (minutes), Maarten Litmaath, Maite Barroso Lopez, Andrea Manzi, Marian Babik, Alessandro Di Girolamo
  • remote: Alessandra Doria, Michael Ernst, Jeremy Coles, Christoph Wissing, David Cameron, Raja Nandakumar, Renaud Vernet, Thomas Hartmann, Dave Mason, Vincenzo Spinoso, Alberto Aimar, Peter Gronbech

Operations News

Middleware News

  • Issues:
    • Critical Vulnerability broadcasted by SVG on Friday 06 affecting NSS. (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2015-CVE-2015-7183). All software where the SSL handshaking is based on Mozilla Network security services which includes RedHat 6 and 7 and its derivatives is affected ( for instance libcurl uses NSS). EGI CISRT put as deadline the 2015-11-13 for patching the hosts. Sites failing to act and/or failing to respond to requests from the EGI CSIRT team risk site suspension.
    • this is a problem affecting not only grid services, Security team @CERN has also sent this week an email to ask all service admins to patch their hosts

  • T0 and T1 services
    • KIT
      • dCache upgraded to v 2.13.9
    • CERN
      • Every EOS deployments upgraded to EOS 0.3.135-aquamarine
    • JINR
      • dCache upgraded to v 2.10.44

Tier 0 News

  • The LSF 9 upgrade of the WNs is in QA testing. The ATLAS Tier-0 LSF instance is upgraded to v9, and the clients are also in QA, Atlas will decide when to upgrade them..
  • The HTCondor capacity represents some 5% of the total batch capacity at CERN; we plan to rather quickly move more resources from LSF to HTCondor to reach some 20…25%. The two ARC CEs are declared obsolete;
  • The Kilo-1 configuration that resulted from performance optimization work jointly done with the cloud team is now running on some 100 lxbatch hosts, so far with very satisfactory results and no indication of any unwanted effect. It will be extended to all hosts when the Openstack kilo release is deployed, estimated at the end of November.
  • IPv6 enabled in MyProxy and VOMS for testing purposes, in dual-stack mode (IPv4 and IPv6).

Andrea S. reports that VOMS does not work over IPv6 and he got confirmation from the main developer. He will open a GGUS ticket for this problem.

DB News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • generally normal to high activity
  • preparations for heavy ion reco jobs:
    • important changes in the code and workflow have been implemented to reduce the memory usage
    • they were tested with 2011 heavy ion reference data
    • if all goes well, for this year's heavy ion data the reco jobs will only need ~2.5 GB RAM
    • to be on the safe side, special arrangements were made with the sites that will receive heavy ion raw data
    • CNAF, KISTI, KIT and SARA have set up dedicated high memory queues
    • at CERN the jobs can request 2 cores and hence have twice the memory
    • all setups have been tested with normal jobs
    • we thank the sites for the good support!

Maarten adds that only real data taking will show how often events requiring a lot of memory will appear. It is understood that requesting two cores per slot will heavily affect the job CPU efficiency, but this is the price to pay. It might be that at CERN this will not be required but it is early to say.

ATLAS

  • Activity as usual
    • new record in parallel running slots: 250k . Thanks to the impact of opportunistic resources like Sim@P1 and NERSC_Edison (together they contributed with more than 50k slots)
  • Frontier and Squid: during the past few days we observed that some of the jobs we are running now (mc15b campaign) are requesting an excessive amount of conditions data. This is creating troubles so some squids and Frontier servers. The problem is understood and fixed, no new tasks like this will be launched. For the existing ones, since they are almost over, we will let them finish
  • Heavy Ion data taking: we are ready for it. Since the processing time of HI is huge, we are ready to use the Tier1s/Tier2s to reconstruct also.
  • Deletion agents: deletion agents were switched off between Sunday night and Wednesday, to allow time to recover data which was scheduled for deletion but was actually needed by some people. Now the deletion agents have been restarted, but they are struggling to keep on with the high amount of deletions.
  • PRODDISK has been decommissioned on all the Tier2s (and Tier3s which wanted).

CMS

  • Preparations for Heavy Ion running continuing
    • No issues so for from the Computing side
  • Very high load in the system
    • Last week sustained ~120k parallel jobs
    • Multi-billion events MC RECO campaign ahead
    • Situation expected to stay like this for weeks

LHCb

  • Operations
    • Very high activities on distributed computing resources with user and simulation workflows
    • Some low levels of Data processing activities ongoing
    • LHCb will participate and take data in lead-ion runs until mid December
  • Issues
    • Several days of failures at SARA when srm was overloaded by a local user.(GGUS:117413, GGUS:117483)
    • Issues with tape movers at RRCKI (GGUS:117444, GGUS:117267)
    • Security vulnerability reported with LHCb setup script in CVMFS which is sourced before every workflow. Under investigation.
  • Development / Outlook
    • Working on interface to HTCondor-CE

Raja and Maarten clarify that ALICE and LHCb are both in the same situation: their HTCondor-CE plugins are basically ready but not yet in production. Alessandro adds that ATLAS already submitted jobs to the CERN HTCondor-CEs and they are ready to be put in production.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

HTTP Deployment TF

The 5th TF meeting took place on 11th Nov - https://indico.cern.ch/event/459419

Minutes are attached to the agenda.

The TF now has a working Nagios probe, endpoint lists from the experiments, regular monitoring of the infrastructure (see links on agenda) and a GGUS support unit. The TF is thus ready to do a "dry run" of its principal activity, helping sites to get their HTTP storage in shape. In the next couple of weeks we will run with a small group of volunteer sites to test/optimise the process which will then be used to ticket and support all remaining sites.

Information System Evolution


  • The first draft of the Future Use Cases document is now available for comments. Deadline to provide input is on 24.11. The document will be presented at the December GDB.
  • There was a TF meeting on 12.11 ( Minutes). All the experiments presented their plans to move to GLUE 2.0 and proposals to simplify the interactions with the IS. Several action items were defined after the meeting:
    • Define a roadmap to stop publishing GLUE 1.3 in coordination with EGI and OSG.
    • Information validation:
      • Document existing validation mechanisms (this is now documented in the TF wiki)
      • Actively validate information that is important for WLCG. Feedback from experiments is needed (especially ATLAS). In particular, validation of the Waiting Jobs GLUE attribute for ALICE has been implemented ( SSB).
      • It was agreed that after the feedback collected so far, it doesn't make sense to define a GLUE 2.0 profile for WLCG.
      • There are ongoing discussions with MW officer to integrate glue-validator within the different services running a resource BDII and improve information quality before it gets published. This will be proposed at the URT meeting on 14th December.
    • Study the proposal of publishing a subset of the current GLUE schema that is useful for WLCG in JSON/HTTPS. Andrew McNab presented his work on publishing Vac/Vcycle resources using this approach.
  • Next meeting is on 26.11 ( Agenda)

IPv6 Validation and Deployment TF


Middleware Readiness WG


  • DPM 1.8.10 installed and verified @ Edinburgh for ATLAS
  • dCache 2.10.44 installed and verified @ TRIUMF for ATLAS
  • EOS testing @ CERN is paused. The new version Citrine has been installed in pre-prod, but is not yet ready for testing.
  • BDII verification on CENTOS7 will start next week @ Brunel
  • MW readiness app v0.3 deployed in prod https://wlcg-mw-readiness.cern.ch/releasenotes/
  • Next meeting Wednesday 2nd December at 4pm CET.

Multicore Deployment

Network and Transfer Metrics WG


  • perfSONAR collector, datastore, publisher and dashboard in production (stable operations)
  • Additional monitoring metrics will be added to psomd.grid.iu.edu to capture collector's efficiency and report on freshness of the metadata in the OSG Datastore (for each sonar).
  • perfSONAR 3.5: 205 sonars were updated, ALL sites are encouraged to enable auto-updates for perfSONAR
  • Pilot projects: ATLAS Panda, perfSONAR stream now in ATLAS Network Analytics (https://twiki.cern.ch/twiki/bin/view/AtlasComputing/ATLASAnalytics), several KIBANA dashboards available - Site link stats. Jorge and Ilija working on cost matrix using the round-trip time and packet loss in Mathis's formula to infer bandwidth (predictions based on this model will follow).
  • Pilot projects: LHCb DIRAC bridge is now functional, processing perfSONAR stream and inserting packet loss metrics in DIRAC, includes mapping to LHCb sites. Henryk, Federico and Stefan are working on this.

RFC proxies

  • NTR

Squid Monitoring and HTTP Proxy Discovery TFs

  • Nothing to report

Action list

Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 15 GGUS tickets opened for SRM and Myproxy certificates not correct, 6 already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well)
2015-10-01 Follow up on reporting of number of processors with PBS John Gordon ONGOING  
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team ONGOING A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting

Maarten adds that, concerning the host certificate issue, it is under control or already solved for all sites but he is still awaiting for feedback from France and he will ping them again.

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-11-05 ATLAS would like to ask sites to provide consistency checks of storage dumps. More information and More details ATLAS - - None -

AOB

Andrew McNab will take over the coordination of the Machine/Job Features task force.

-- AndreaSciaba - 2015-11-17

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback