WLCG Operations Coordination Minutes - August 21st, 2014

Agenda

Attendance

Local: Andrea Sciabà (secretary), Stefan Roiser, Marian Babik, Maite Barroso, Andrea Manzi, Marcin Blaszczyk

Remote: Josep Flix (chair), Yury Lazin (RRC-KI-T1), Dave Dykstra, Christoph Wissing, Frederique Chollet, Hung-Te Lee, Alexey Sedov, Di Qing, Burt Holzman, Alessandro Cavalli (INFN-T1)

Operations News

  • CERN-IT to terminate the SLC5-based interactive and batch services (lxplus5 and lxbatch5) soon. The current target date is 30 September 2014. These services have been replaced by SLC6-based services (lxplus6 and lxbatch6). If there are serious concerns for the termination of these SLC5-based services by this target date, please contact the CERN Service Desk (http://cern.ch/service-portal or service-desk@cernNOSPAMPLEASE.ch) by 14 September 2014 with details of the use cases affected.
  • A study to assess how operational effort in WLCG is used and could be optimised will probably launch in the next weeks. This will cover the management of sites and site services. It will not cover the experiment computing operations, apart from the aspects which are strongly affected by the WLCG infrastructure and how it is operated. As more details are available they will be announced; let us know if you would like to be involved already from now.

Middleware News

  • Baselines:
    • No changes with respect to the previous meeting
    • MW Issues:
      • Storm and Argus integrations issues:There are memory leaks + missing gridftp banning via ARGUS in Storm; The Memory leak issue is under investigation while the griftp banning is under implementation.
      • APEL fails to parse accounting records , affecting APEL 1.2.1: A new APEL version(1.2.2) with a fix to the problem has been published by the devs (https://github.com/apel/apel/releases/tag/1.2.2-1), to be installed ASAP by the sites affected by the issue.
  • T0 and T1 services
    • NDGF is planning to upgrade to dCache 2.10.3 when it will be released.
    • FTS2 decommissioning
      • Done at CERN
      • under decommissioning PIC/TRIUMF
    • Recent changes:
      • CNAF: xrootd upgrade for ALICE
      • NL_T1 ( Nikhef) : FAX Installation
      • PIC : FAX Installation
  • CVMFS upgrade to 2.1.19
    • Almost completed, only 4 GGUS ticket still open.

Maite asks what sites still have FTS2. The answer is in the twiki. Concerning FTS3, only a few Tier-1's will install it and apart from production instances there will be instances to be used in emergencies or by other VOs.

Andrea M. adds that, concerning the CVMFS client deployment, only 2 CMS sites, 1 ATLAS and 1 ALICE site are still missing.

Concerning the APEL problem, a few sites have been affected because only a few sites installed 1.2.1, which was released only in mid August. It's not clear if it affects all batch systems or only some.

Oracle Deployment

Marcin announces that IT-DB will perform a new hardware installation round both in the Meyrin computer centre and Wigner, migrating a number of databases to new locations, similarly as it was done in spring. This concerns in particular Dataguard and Active Dataguard services. The target is Q4 2014. By the next meeting more details will be available.

Tier 0 News

  • Argus: The latest version of Argus deployed in production, solving a bug with respect to CAs that become unresponsive, comes as an RPM taken from GitHub, it is not stored in any of the official repositories; we wanted to check with other sites running this version if they have seen any issues with it in the last month (we did not), so it can become a proper released version in a proper repository.
  • FTS3: It would be useful to get the Nagios FTS3 probes; the Nagios EGI team is preparing a new release, would be great to have them included there, as CERN is running a Nagios NGI instance, released and maintained by EGI.
  • AFS UI: it is old and unmaintained, so we are looking at decommissioning it. Our understanding is that there are alternatives to it: any SLC6 system can use the native UI like we have on lxplus, and the CVMFS WN and UI are being worked on. The proposal is to decommission it at the end of September. If there are use cases that cannot be changed before that date, please, let us know through GGUS/SNOW tickets.
  • lxplus5, lxbatch5: it is proposed to terminate the SLC5-based interactive and batch services (lxplus5 and lxbatch5) soon. The current target date is 30 September 2014. These services have been replaced by SLC6-based services (lxplus6 and lxbatch6). If there are serious concerns for the termination of these SLC5-based services by this target date, please contact the CERN Service Desk (http://cern.ch/service-portal or service-desk@cernNOSPAMPLEASE.ch) by 14 September 2014 with details of the use cases affected.
    • CERN Status Board Information: https://cern.service-now.com/service-portal/view-outage.do?from=CSP-Service-Status-Board&&n=OTG0013168.
    • The date of end September is not cast in stone, but it would be really inconvenient if we had to run lxplus5 and lxbatch5 beyond the end of Quattor services (currently foreseen for end October), as significant work would be needed to migrate these services to Puppet.
    • Note that the announcement only concerns the lxplus5 and lxbatch5 services; the support for Scientific Linux CERN 5 is not affected, and it goes till March 2017
    • A few users have already contacted us pointing out that they need SLC5 to build their software, as they haven't completed the porting to SLC6 yet. If these needs continue beyond the lxplus5 closure, we will provide detailed instructions to users how to set up a private virtual machine under Openstack, on which software can be built.

About the AFS UI, Christoph adds that CMS still uses it for LXPLUS5. He asks about the status of the tarball distribution: it needs to be clarified if it is still maintained by somebody.

Maite adds that if needed the LXPLUS5/LXBATCH5 decommissioning might be postponed to the end of October, any time later than that would be very difficult due to the Quattor decommissioning. CMS is collecting use cases for LXPLUS5/LXBATCH5. Andrea S. mentions that in a private communication ATLAS people said that it is not an issue for distributed computing but they need to check with the software people.

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • steady production and analysis activities throughout the past weeks

ATLAS

CMS

  • Processing overview:
    • Finishing samples for CSA14
    • Mainly waiting for new requests within coming weeks
  • CSA14 (Computing Analysis Software challenge 2014)
    • Extended by two weeks until mid of September
    • CRAB3
      • Gained experience through user feedback
      • Load through actual user submission small, but backfill from HammerCloud
    • AAA
      • Users rather happy with remote access possibility
      • Increase scale over time
    • Exercise Dynamic Data Placement and Cache Release
    • miniAOD
      • Well received by user community
      • Planning re-running of all miniAOD production in September
  • SL5 lxplus
    • Proposed date for end of lxplus5 (Sep 30th) might be to early
      • Still required for older releases to build libraries
    • In discussion with CMS physics community
  • FTS3
    • All relevant sites migrated
  • VO card update process
    • Requires verification - who does it?
  • Reminder for sites:
    • Need to change xrootd redirectors, see this hn post
    • Need to adapt site-local-config.xml to include <phedex-node value=“Tx_CO_Site{_type}"/> (e.g. value=“T1_DE_KIT_Disk") in the <local-stage-out> section and the same format (but the PhEDEx name for the fallback endpoint) in <fallback-stage-out> NEW
    • Need to upgrade to CVMFS >= 2.1.19 immediately

Andrea S. will contact EGI to understand if it is possible to avoid to wait for changes in the VO card of LHC VOs being validated by the EGI operations portal people. Update: Peter Solagna agreed that WLCG VOs should be able to auto-approve their changes and he will submit a feature request to the operations portal team.

LHCb

  • Operations
    • Low activity, mainly monte carlo simulation and user jobs
  • WMS decommissioning
    • For SAM/Nagios in order to probe the ARC CEs at several UK sites, the probes are submitted now via a WMS instance from RAL-LCG2. The WMS instance was confirmed to be kept in production also for this purpose at least until 2015. On the long term the possibility to use a probe from the Nordugrid team for direct submission and executing the different WN payloads will be checked.
  • IPv6
    • First basic tests, e.g. for setting up the runtime environment successfully completed. It was found that Myproxy is not accessible from an IPv6-only node.

Ongoing Task Forces and Working Groups

Tracking Tools Evolution TF

FTS3 Deployment TF

gLExec Deployment TF

  • NTR

Machine/Job Features

  • Unfortunately for the TF the developer and maintainer of the condor implementation will leave the HEP community. OSG was asked if it was possible to have somebody joining the TF to continue the work.

Middleware Readiness WG

  • The pilot component for MW readiness ( DPM ) was correctly verified by ATLAS and CMS workflows for both installation at GRIF and Edinburgh.
  • A new version of the WLCG Package Reporter was released. We kindly ask T0 pre-prod machines to be equipped with the Package reporter in order to test scalability and to increase the packages database.
  • Discussion started with Legnaro Site Manager in order to install the new BDII update 9 and Cream-ce 1.6.3 for CMS verification. The site manager is going to upgrade next week to the latest version of the software while A. Sciaba is working on the definition of the CMS Workflow for the verification.
  • The design of the MW readiness software has started. It will use alternatively the packages information coming from the WLCG Package reporter or Pakiti.
  • Book your diaries!! Next meeting on October 1st at CERN with audioconf and vidyo. Provisional Agenda here

Multicore Deployment

SHA-2 Migration TF

  • introduction of the new VOMS servers
    • compliance of the WLCG infrastructure was to be tested with the SAM preprod instances
      • so far done only for ALICE, since the early hours of July 23
      • LHCb next?
    • latest results for ALICE show:
      • OK tests for half the sites
      • many failures due to simple configuration issues of CREAM and/or Argus
      • no new showstopper so far
    • proposal:
      • next broadcast on Monday
      • hard deadline: Mon Sep 15
        • normal jobs and SAM tests will start using the new VOMS servers as of that date
        • the old VOMS servers will continue to be used in parallel
      • sites that fail the SAM preprod tests by the end of Aug will be ticketed
      • we need to verify all experiment workflows with the new servers ASAP !

WMS Decommissioning TF

  • Condor validation
    • All issues (34 in total) were resolved and both ATLAS and CMS are ready for production
    • Deployment to production is planned on Wed 1st of October 2014

IPv6 Validation and Deployment TF

  • Ran some tests in a pure IPv6 EMI-3 UI in Oxford (thanks to Ewan MacMahon). The following was observed:
    • CVMFS works fine
    • voms-proxy-init (from the new Java-based client) doesn't seem to work with a dual stack VOMS server - to be investigated
    • lcg-info(sites) work over IPv6 if they are hacked to add use Net::INET6Glue;; to be fixed in a future release if possible
    • lcg-cr tested to work (as expected) over IPv6 provided that GLOBUS_FTP_CLIENT_IPV6=true is set

Squid Monitoring and HTTP Proxy Discovery TFs

  • Reactivated Squid Monitoring TF to track its task list which were quite far from being completed.
    • Two new members added Costin Grigoras (ALICE) and David Crooks (NGI_UK)
    • Basic MRTG monitoring via registered squids in GOCDB/OIM is now almost completed
    • Soon need to start a campaign to get sites to register their squids -- how to do that?
    • Meeting scheduled for 28 August to discuss Costin's new squid monitor based on MonALISA
  • Proxy Discovery TF will be able to make some progress on its task list after sufficient squids are registered -- estimated dates were pushed out one month
    • Costin Grigoras added as a member to this TF too

Network and Transfer Metrics WG

  • Updated WG page with list of members, task tracking, coming events and reports (https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics)
  • Kick-off meeting will take place on Mon 8th of Sept at 3PM CEST
  • On July 21st perfSONAR Toolkit 3.4rc2 became available for testing, version 3.4 is a major milestone for the WG as it enables access via REST API and introduces several important performance improvements, therefore deployment campaign will follow once we get a stable release
  • Work is progressing on the WLCG perfSONAR configuration interface (finalized design, work is ongoing on a prototype implementation)
  • OSG perfSONAR datastore plan has been agreed and testing of the store based on esmond is ongoing
WLCG perfSONAR service level report on 2014-08-20 16:59:32.876708=======

perfSONAR instances monitored: 214
perfSONAR-PS versions deployed: 
   3.2.2 : 6
   3.3.1 : 3
   3.3.2 : 174
   Unknown: 27
GOCDB registered total: 170
OIM registered total: 53
Unreachable instances (not monitored): 8
Incorrectly configured (failing >4 metrics): 30

Action list

  1. NEW on the WLCG middleware officer and the experiment representative: for the experiments to report their usage of the AFS UI and for the middleware officer to take the steps needed to enable the CVMFS UI distribution as a viable replacement for the AFS UI.
  2. NEW on Andrea S.: to understand with EGI if it is possible to bypass the validation of VO card changes in the case of the LHC VOs.
  3. ONGOING on the WLCG monitoring team: status of the CondorG probes for SAM to be able to decommission SAM WMS. Status: see these minutes.
  4. ONGOING on the middleware officer: report about progress in CVMFS 2.1.19 client deployment. Status: see these minutes.
  5. ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Report about showstoppers. Status: the SAM team made a proposal on the steps to taken to enable SAM. ATLAS is following up to make sure that the new CEs are correctly visible in AGIS, while for the CMS VO feed they will be taken directly from OIM. The plan is at first to test HTCondor-CEs in preproduction and later switch to production. It is not foreseen to monitor at the same time GT5 and HTCondor endpoints on the same host.
  6. CLOSED on the Operations Coordinators: Follow up with LHCb performance issues with voms-clients v3. Status: LHCb is evaluating the impact of the new client and from the first tests it is not expected to be an issue.

AOB

-- MariaALANDESPRADILLO - 20 Jun 2014

Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2014-08-27 - AlessandroCavalli
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback