WLCG Operations Coordination Minutes, May 21st 2015


Agenda

Attendance

  • local: Oliver Keeble, Marian Babik, Maarten Litmaath, Andrea Sciaba, Maite Barroso, Maria Alandes, Alberto Aimar, David Cameron, Maria Dimou
  • remote: Christoph Wissing, Jeremy Coles, Pepe Flix, Massimo Sgaravatto, Renaud Vernet, Thomas Hartmann, Antonio Maria Perez Calero Yzquierdo, Alessandra Forti, Gareth Smith, Catherine Biscarat

Operations News

Middleware News

  • Baselines:
    • EMI update today containing Storm 1.11.8. This version has already been verified by MW readiness and as soon as will get into UMD ( end of May) will be set as baseline
    • dCache 2.10.28/2.12.8 verified by MW readiness and set as baseline ( fixes for DB leak)
    • as discussed in the previous meeting, torque 2.5.13 has been added to the baselines table

  • MW Issues:
    • NTR

  • T0 and T1 services
    • CERN
      • CASTOR for LHC has been updated to 2.1.15. Small delta releases (2.1.15-8 are being rolled out)
      • SRM validation going on. xroot is now the main access protocol (RFIO is obsolete and its possible decommissioning will be discussed at the end of 2015 to take place in 2016 or later)
    • KIT
      • Updated all dCache setups to 2.11.19 last week. Very urgent update due to a leak in the Chimera database
    • IN2P3
      • plan to upgrade dCache to 2.10.30+ on core servers (16/06/2015)
    • RRC-KI-T1
      • dCache upgrade to 2.10.29

Tier 0 News

Pepe asks for a clarification about what functional tests mean. Maite confirms these refer to CMS SAM tests.

Maarten asks whether there is any problem in EGI with LFC decommissioning at CERN. Maite confirms EGI doesn't rely on CERN LFC. It was checked with them months ago.

Tier 1 Feedback

None

Tier 2 Feedback

None

Experiments Reports

ALICE

  • normal to high activity
  • no major operational issues

ATLAS

  • Very high activity (> 200k running job slots today)
  • Only possible through mix of single and multi-core jobs (see later)
  • Running final stress tests of data transfer this week (inside CERN and outside) before data taking
  • Request T1 sites to avoid major downtimes from now until late summer

CMS

  • Production overview
    • Finished with Upgrade DIGI-RECO (for now)
      • More memory intense production compared to usual workloads
      • Needed to run in multi-core pilots not using all cores
    • Finally started Run2 DIGI-RECO campaign
      • Will keep resources rather busy for next weeks/months
      • Successfully extended to stronger Tier-2 sites

  • Global Xrootd re-director at CERN
    • Is an important component for CMS
    • Increased usage and higher dependency on the service
      • Many users requesting files via Global redirector
      • Production jobs occasionally sent to sites not hosting data
    • CMS requests an increase in impact to 8 (from 5) and urgency to 6 (from 5) number for WLCG critical service
    • Quite long iteration to get a configuration settled recently GGUS:113032

Pepe asks for a clarification on the usage of multicore pilots not using all of the cores. Christoph explains that there was indeed a mix of single core and multi core jobs.

Maarten suggests that the requested change to the global xrootd redirector should be recorded in the Critical Services Table. Maite will follow up but she explains this is already in the list of critical services.

Pepe asks how changes on the critical services table should be communicated. There was a meeting in December for defining this, Maite adds that it shouldn't change very often in any case.

LHCb

  • LHCb Computing workshop going on, so not much operation follow up
  • Downtime today though due to the Oracle DB upgrade
  • T1
    • Problem to contact SARA SRM Seems to be back. They claim that the fetch-crl problem could be now on CERN side. Upgrading on FTS servers and our vobox planned.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

RFC proxies

  • SAM-Nagios refresh_proxy probe should handle RFC proxies transparently when UMD-3 is used
    • preprod already has UMD-3, prod still has UMD-2
    • we will try this out on ALICE preprod
  • the readiness of sites then can be observed per experiment
    • no failures are expected to be due to the change of proxy type
  • when the sites are checked OK, the central services of each experiment are next
    • plus pilot factories at T1 etc.
  • experiment status
    • ALICE: done
    • ATLAS:
      • check central services
      • ensure all pilot factories run a recent Condor-G with a UMD-3 CREAM client
    • CMS: ditto
    • LHCb: check DIRAC

Pepe asks whether there is any feedback given by the experiments on the timeline to implement this. Maarten explains there is no timeline yet but in any case it's not critical. Maarten explains that in the future a new version of the voms clients would be needed and this will require coordination with EGI.

David Cameron asks whether MW services are certified to work with RFC proxies. Maarten confirms this is the case. Pepe asks whether that means that all MW services deployed at sites are able to support RFC proxies. Maarten explains that this should be the case if the MW version is the one currently supported and not an old version. But normally EGI follows up on sites running obsolete versions.

Machine/Job Features

Middleware Readiness WG


  • Very good work in the Readiness verification for dCache v.2.10.28 by Triumf. This is how this version is now part of the Baseline.
  • Progress is being made for the StoRM verification for ATLAS at CNAF. The workflow documentation is at hand. All relevant experts in sync and follow-up is in jira ticket MWREADY-61.
  • The MW Officer records in jira ticket MWREADY-59 progress on the MW Readiness App development. Remember this presentation' from the last WG meeting on May 6th. This tool will, eventually, replace today's, manually maintained, Baseline table.

Multicore Deployment

  • ATLAS deployment:
    • Goal: 80% of production resources usable by multicore. This corresponds roughly to 150k slots. The max achieved has been 110k slots. It is not a problem of sites not having a queue as only 9 smaller sites are still missing. It's a matter of tuning the system.
    • ATLAS is looking at increasing the jobs length up to about 10-15h for 8-core slots to improve on the efficiency and is also looking at a global fairshare to avoid sending too many low priority single core jobs when there are high priority multicore in the system.
    • Sites should instead look at their setup and try to prioritize multicore over single core. The base TF solutions can be found here. In particular Torque, Htcondor and SGE have all a solution.
      • Sites that are already dynamically configured but still cap the multicore slots should also revise this number to respect the 80% share. As a reminder the shares for ATLAS jobs are as follows T1: 5% analysis and 95% production; T2: 50% analysis and 50% production. So multicore should get 80% of 95% at T1s and 80% of 50% at T2.
  • CMS status:
    • Consolidating deployment to T1s: working together with CNAF and CCIN2P3 to increase multicore resources allocation results
    • Analysis of multicore pilots internal (in)efficiencies ongoing

Pepe asks how many sites are offering static multicore resources for ATLAS. Alessandra says that concrete numbers could be checked looking at the monitoring tools, but it's not the case for most sites.

IPv6 Validation and Deployment TF


Squid Monitoring and HTTP Proxy Discovery TFs

  • No news, main developers haven't been able to give it any time lately because of other priorities

Network and Transfer Metrics WG


  • perfSONAR status
    • Security: New SSL vulnerability dubbed Logjam: https://weakdh.org/sysadmin.html. WLCG perfSONAR hosts should NOT be vulnerable to this attack. The Apache configuration installed by the Toolkit disables the cipher suites in question by default.
  • Network performance incidents process - new GGUS SU (WLCG Network Throughput) will become available on 24th of June.
  • Next meeting 3rd of June (https://indico.cern.ch/event/382624/). Plan is to focus it on latency ramp up and proximity service.

Pepe asks whether this new GGUS SU is the preferred channel to report issues. Marian confirms this is the case.

Pepe asks how many incidents have been discussed so far. Marian replies that only one, the AGLT2 to SARA incident.

HTTP Deployment TF

The first TF meeting has taken place, minutes are attached to the agenda, all reachable from the TF home page - https://twiki.cern.ch/twiki/bin/view/LCG/HTTPDeployment

Some first steps were agreed, in particular identification of an appropriate monitoring solution and the compilation of a full list of functionality that the experiments would like to see delivered via HTTP. The next meeting will focus on the latter topic.

Maria Alandes asks how the monitoring is going to be organised, where the HTTP endpoints are going to be taken from. Oliver explains this hasn't been decided yet. At least LHCb and ATLAS have these defined in their configuration DBs. Maria adds that Alessandro has been in touch with her to check whether HTTP endpoints could be taken from the BDII instead of being manually inserted in AGIS by sys admin. She can report on her findings to the TF. Oliver explains this could be indeed interesting but configuration DBs show the endpoints that are used by experiments and TF should focus on those.

Action list

Description Responsible Status Comments
CMS instructions to shifters to be changed so that tickets are not opened if just one CE is red. C. Wissing CLOSED Christoph explains this is now understood among the different involved parties and could be now closed

AOB

-- MariaALANDESPRADILLO - 2015-05-06

-- MariaALANDESPRADILLO - 2015-05-20

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback