WLCG Operations Coordination Minutes - August 21st, 2014

Agenda

Attendance

Local: Andrea Sciabà (secretary

Operations News

  • CERN-IT to terminate the SLC5-based interactive and batch services (lxplus5 and lxbatch5) soon. The current target date is 30 September 2014. These services have been replaced by SLC6-based services (lxplus6 and lxbatch6). If there are serious concerns for the termination of these SLC5-based services by this target date, please contact the CERN Service Desk (http://cern.ch/service-portal or service-desk@cernNOSPAMPLEASE.ch) by 14 September 2014 with details of the use cases affected.
  • A study to assess how operational effort in WLCG is used and could be optimised will probably launch in the next weeks. This will cover the management of sites and site services. It will not cover the experiment computing operations, apart from the aspects which are strongly affected by the WLCG infrastructure and how it is operated. As more details are available they will be announced; let us know if you would like to be involved already from now.

Middleware News

  • Baselines:
    • No changes with respect to the previous meeting
    • MW Issues:
      • Storm and Argus integrations issues:There are memory leaks + missing gridftp banning via ARGUS in Storm; The Memory leak issue is under investigation while the griftp banning is under implementation.
      • APEL fails to parse accounting records , affecting APEL 1.2.1: A new APEL version(1.2.2) with a fix to the problem has been published by the devs (https://github.com/apel/apel/releases/tag/1.2.2-1), to be installed ASAP by the sites affected by the issue.
  • T0 and T1 services
    • NDGF is planning to upgrade to dCache 2.10.3 when it will be released.
    • FTS2 decommissioning
      • Done at CERN
      • under decommissioning PIC/TRIUMF
    • Recent changes:
      • CNAF: xrootd upgrade for ALICE
      • NL_T1 ( Nikhef) : FAX Installation
      • PIC : FAX Installation
  • CVMFS upgrade to 2.1.19
    • Almost completed, only 4 GGUS ticket still open.

Oracle Deployment

Tier 0 News

  • Argus: The latest version of Argus deployed in production, solving a bug with respect to CAs that become unresponsive, comes as an RPM taken from GitHub, it is not stored in any of the official repositories; we wanted to check with other sites running this version if they have seen any issues with it in the last month (we did not), so it can become a proper released version in a proper repository.
  • FTS3: It would be useful to get the Nagios FTS3 probes; the Nagios EGI team is preparing a new release, would be great to have them included there, as CERN is running a Nagios NGI instance, released and maintained by EGI.
  • AFS UI: it is old and unmaintained, so we are looking at decommissioning it. Our understanding is that there are alternatives to it: any SLC6 system can use the native UI like we have on lxplus, and the CVMFS WN and UI are being worked on. The proposal is to decommission it at the end of the month. If there are use cases that cannot be changed before that date, please, let us know through GGUS/SNOW tickets.
  • lxplus5, lxbatch5: it is proposed to terminate the SLC5-based interactive and batch services (lxplus5 and lxbatch5) soon. The current target date is 30 September 2014. These services have been replaced by SLC6-based services (lxplus6 and lxbatch6). If there are serious concerns for the termination of these SLC5-based services by this target date, please contact the CERN Service Desk (http://cern.ch/service-portal or service-desk@cernNOSPAMPLEASE.ch) by 14 September 2014 with details of the use cases affected.
    • CERN Status Board Information: https://cern.service-now.com/service-portal/view-outage.do?from=CSP-Service-Status-Board&&n=OTG0013168.
    • The date of end September is not cast in stone, but it would be really inconvenient if we had to run lxplus5 and lxbatch5 beyond the end of Quattor services (currently foreseen for end October), as significant work would be needed to migrate these services to Puppet.
    • Note that the announcement only concerns the lxplus5 and lxbatch5 services; the support for Scientific Linux CERN 5 is not affected, and it goes till March 2017
    • A few users have already contacted us pointing out that they need SLC5 to build their software, as they haven't completed the porting to SLC6 yet. If these needs continue beyond the lxplus5 closure, we will provide detailed instructions to users how to set up a private virtual machine under Openstack, on which software can be built.

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • steady production and analysis activities throughout the past weeks

ATLAS

CMS

  • Processing overview:
    • Finishing samples for CSA14
    • Mainly waiting for new requests within coming weeks
  • CSA14 (Computing Analysis Software challenge 2014)
    • Extended by two weeks until mid of September
    • CRAB3
      • Gained experience through user feedback
      • Load through actual user submission small, but backfill from HammerCloud
    • AAA
      • Users rather happy with remote access possibility
      • Increase scale over time
    • Exercise Dynamic Data Placement and Cache Release
    • miniAOD
      • Well received by user community
      • Planning re-running of all miniAOD production in September
  • SL5 lxplus
    • Proposed date for end of lxplus5 (Sep 30th) might be to early
      • Still required for older releases to build libraries
    • In discussion with CMS physics community
  • FTS3
    • All relevant sites migrated
  • VO card update process
    • Requires verification - who does it?
  • Reminder for sites:
    • Need to change xrootd redirectors, see this hn post
    • Need to adapt site-local-config.xml to include <phedex-node value=“Tx_CO_Site{_type}"/> (e.g. value=“T1_DE_KIT_Disk") in the <local-stage-out> section and the same format (but the PhEDEx name for the fallback endpoint) in <fallback-stage-out> NEW
    • Need to upgrade to CVMFS >= 2.1.19 immediately

LHCb

  • Operations
    • Low activity, mainly monte carlo simulation and user jobs
  • WMS decommissioning
    • For SAM/Nagios in order to probe the ARC CEs at several UK sites, the probes are submitted now via a WMS instance from RAL-LCG2. The WMS instance was confirmed to be kept in production also for this purpose at least until 2015. On the long term the possibility to use a probe from the Nordugrid team for direct submission and executing the different WN payloads will be checked.
  • IPv6
    • First basic tests, e.g. for setting up the runtime environment successfully completed.

Ongoing Task Forces and Working Groups

Tracking Tools Evolution TF

FTS3 Deployment TF

gLExec Deployment TF

  • NTR

Machine/Job Features

  • Unfortunately for the TF the developer and maintainer of the condor implementation will leave the HEP community. OSG was asked if it was possible to have somebody joining the TF to continue the work.

Middleware Readiness WG

  • The pilot component for MW readiness ( DPM ) was correctly verified by ATLAS and CMS workflows for both installation at GRIF and Edinburgh.
  • A new version of the WLCG Package Reporter was released. We kindly ask T0 pre-prod machines to be equipped with the Package reporter in order to test scalability and to increase the packages database.
  • Discussion started with Legnaro Site Manager in order to install the new BDII update 9 and Cream-ce 1.6.3 for CMS verification. The site manager is going to upgrade next week to the latest version of the software while A. Sciaba is working on the definition of the CMS Workflow for the verification.
  • The design of the MW readiness software has started. It will use alternatively the packages information coming from the WLCG Package reporter or Pakiti.
  • Book your diaries!! Next meeting on October 1st at CERN with audioconf and vidyo. Provisional Agenda here

Multicore Deployment

SHA-2 Migration TF

  • introduction of the new VOMS servers
    • compliance of the WLCG infrastructure was to be tested with the SAM preprod instances
      • so far done only for ALICE, since the early hours of July 23
      • LHCb next?
    • latest results for ALICE show:
      • OK tests for half the sites
      • many failures due to simple configuration issues of CREAM and/or Argus
      • no new showstopper so far
    • proposal:
      • next broadcast on Monday
      • hard deadline: Mon Sep 15
        • normal jobs and SAM tests will start using the new VOMS servers as of that date
        • the old VOMS servers will continue to be used in parallel
      • sites that fail the SAM preprod tests by the end of Aug will be ticketed
      • we need to verify all experiment workflows with the new servers ASAP !

WMS Decommissioning TF

  • Condor validation
    • All issues (34 in total) were resolved and both ATLAS and CMS are ready for production
    • Deployment to production is planned on Wed 1st of October 2014

IPv6 Validation and Deployment TF

  • Ran some tests in a pure IPv6 EMI-3 UI in Oxford (thanks to Ewan MacMahon). The following was observed:
    • CVMFS works fine
    • voms-proxy-init (from the new Java-based client) doesn't seem to work with a dual stack VOMS server - to be investigated
    • lcg-info(sites) work over IPv6 if they are hacked to add use Net::INET6Glue;; to be fixed in a future release if possible
    • lcg-cr tested to work (as expected) over IPv6 provided that GLOBUS_FTP_CLIENT_IPV6=true is set

Squid Monitoring and HTTP Proxy Discovery TFs

  • Reactivated Squid Monitoring TF to track its task list which were quite far from being completed.
    • Two new members added Costin Grigoras (ALICE) and David Crooks (NGI_UK)
    • Basic MRTG monitoring via registered squids in GOCDB/OIM is now almost completed
    • Soon need to start a campaign to get sites to register their squids -- how to do that?
    • Meeting scheduled for 28 August to discuss Costin's new squid monitor based on MonALISA
  • Proxy Discovery TF will be able to make some progress on its task list after sufficient squids are registered -- estimated dates were pushed out one month
    • Costin Grigoras added as a member to this TF too

Network and Transfer Metrics WG

  • Updated WG page with list of members, task tracking, coming events and reports (https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics)
  • Kick-off meeting will take place on Mon 8th of Sept at 3PM CEST
  • On July 21st perfSONAR Toolkit 3.4rc2 became available for testing, version 3.4 is a major milestone for the WG as it enables access via REST API and introduces several important performance improvements, therefore deployment campaign will follow once we get a stable release
  • Work is progressing on the WLCG perfSONAR configuration interface (finalized design, work is ongoing on a prototype implementation)
  • OSG perfSONAR datastore plan has been agreed and testing of the store based on esmond is ongoing
WLCG perfSONAR service level report on 2014-08-20 16:59:32.876708=======

perfSONAR instances monitored: 214
perfSONAR-PS versions deployed: 
   3.2.2 : 6
   3.3.1 : 3
   3.3.2 : 174
   Unknown: 27
GOCDB registered total: 170
OIM registered total: 53
Unreachable instances (not monitored): 8
Incorrectly configured (failing >4 metrics): 30

Action list

AOB

-- MariaALANDESPRADILLO - 20 Jun 2014

Edit | Attach | Watch | Print version | History: r22 | r20 < r19 < r18 < r17 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r18 - 2014-08-22 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback