WLCG Operations Coordination Minutes - September 4th, 2014

Agenda

Attendance

Operations News

Middleware News

  • the WLCG repository will become signed in the near future
    • further news will be announced in due course

  • Baselines:
    • No new EMI/UMD releases in the period
    • removed FTS2 from table as decommissioned or under decommissioning everywhere
  • MW Issues:
    • dCache issue with a Brazilian CA Certificate : Issue affecting LHCb in some sites PIC, Gridka, SARA, there is a workaround but a fix is also under preparation ( v2.6.34)
    • Missing Key Usage extension in delegated proxy: it affects Atlas and Rucio integration , the fix for Cream UI has been postponed to October, it will be integrated in HTCondor then. This does not affect submission with RFC proxies.
    • no updates regarding the latest version of Argus, still not officially released.
  • T0 and T1 services
    • FTS2 decommissioning
      • Done at TRIUMF,NL_T1, RAL
      • under decommissioning ASGC,CNAF, NDGF-T1
    • Planned changes:
      • IN2P3 : dCache upgrade to 2.6.31 on 23/09
      • NL_T1: dCache upgrade to 2.6.34 to fix issue with Brazilian certificates when available.

Oracle Deployment

Tier 0 News

  • AFS UI:in contact with the AFS team to get some info about who is using it. What we got till now: few hertz actual reads, high stat rate indicate of lots of people having the PATH and LD_LIBRARY_PATH set up.
  • lxplus5: The current target date is 30 September 2014. These services have been replaced by SLC6-based services (lxplus6 and lxbatch6). Note that the deadline to report serious concerns for the termination of these SLC5-based services is 14 September 2014, please contact the CERN Service Desk (http://cern.ch/service-portal or service-desk@cernSPAMNOTNOSPAMPLEASE.ch).
  • Argus: we suspect we've seen the same problem on the 28th (related to unresponsive CAs); so, we still think that fixes need to be properly released rather than manually installed from a github, and in addition it would be better if the problem was properly debugged/fixed.

Tier 1 Feedback

  • NDGF-T1 is testing FAX using native xrootd and nfs4-mount from dCache.

Tier 2 Feedback

Experiments Reports

ALICE

  • low activity recently, new productions are being prepared
  • CERN: job failure rates and efficiencies under investigation

ATLAS

  • in addition to the reports of the specific task forces
    • DQ2 -> Rucio commissioning: FullChain test (T0-like workflow) ongoing, adding now automatic check of quotas. No direct implications for sites, but Rucio test and normal DQ2 production activity are producing a slightly higher load on the storage of the sites.
    • ProdSys1 -> ProdSys2 commissioning ongoing: many workflows implemented and tested. First "real production" workflows will start now. No implications for the sites
    • Activities:
      • Analysis as usual
      • DC14 G4 simulation almost done
      • DC14 mc14_13TeV reconstruction not started yet, pending validation, will run on multicore

CMS

  • Processing overview:
    • Samples for CSA14 basically finished
    • Additional test workloads send to Tier-1 sites (mostly reading via xrootd from CERN EOS)
  • SL5 lxplus5/lxbatch5
    • Computing coordinator in contact with Physics groups
      • Feedback will be given by mid of September
  • AFS-UI at CERN
    • Many CMS SL5 based services and clients still use gLite clients from AFS
    • Efforts ongoing to move the services to SL6 (which has to happen anyhow)
    • Usage of Grid clients from CVMFS under investigation
  • Reminders for sites
    • Upgrade CVMFS to 2.1.19
    • Update xrootd fallback configuration
      • Will start to open tickets - we keep on asking since months
      • Solid setup needed for AAA scale testing towards end of September
    • Add "Phedex Node Name" to site configuration

LHCb

  • LHCb activity dominated by simulation productions which, in waves, reach the full capacity.
  • It has been x-checked that all references of DIRAC to the AFS UI have been removed. Nevertheless LHCb suggests to only rename the AFS UI volume in a first step in order to have a fallback in case things go wrong.
  • Testing of voms3 client within LHCb environment will start soon.
  • SHA2 certificate testing started. Many sites failing SAM tests and have been contacted by Maarten -> many thanks to him

Ongoing Task Forces and Working Groups

Network and Transfer Metrics WG

  • Kick-off meeting will take place on Mon 8th of Sept at 3PM CEST
  • Early version of the WLCG perfSONAR configuration interface will be deployed to production next week.
  • Pythia Network Diagnosis Infrastructure (PuNDIT) project will be funded by NSF and starts at the beg. of September (lead by Shawn). The project will use perfSONAR-PS data to identify and localize network problems using the Pythia algorithms. PuNDIT will collaborate with OSG and WLCG over its two year duration.
  • Sites with incorrect versions of perfSONAR will receive tickets at the beg. of next week (9 sites in total)
   WLCG perfSONAR service status report on 2014-09-04 10:03:22.949799 =======

perfSONAR instances monitored: 214
perfSONAR-PS versions deployed: 
   3.2.2 : 6
   3.3.1 : 3
   3.3.2 : 173
   Unknown: 28
GOCDB registered total: 170
OIM registered total: 53
Unreachable instances (not monitored): 8
Incorrectly configured (failing >4 metrics): 28 

Tracking Tools Evolution TF

FTS3 Deployment TF

  • The FTS3 task force can be considered as successfully finished: FTS3 is now the production FTS service for all the LHC experiments.
  • FTS3 overall: the plan is to have new releases every 3 - 4 months (unless urgent issues appear). We will organize dedicated demo/overview meetings in advance to demo and discuss the new features and functionality.

  • Details of the past weeks:
    • FTS Dashboard:
      • New aggregation views have been included:
        • to monitor the active & queued transfers per endpoint/hosts
        • to monitor VO’s activities (user files, production, T0 export…) transfers
      • Ongoing tasks:
        • Fix the computing of transient state in FTS states monitoring view
        • Implement the FTS3 VO/activity-shares monitoring

gLExec Deployment TF

  • NTR

Machine/Job Features

  • New contact for the condor implementation has been suggested by OSG and is currently being introduced to the topic -> welcome Marian Zvada.

Middleware Readiness WG

  • We kindly asked last week T0 pre-prod machines to be equipped with the Package reporter in order to test scalability and to increase the packages database. Any news?
  • The latest Cream-CE and Bdii update have been installed at LNL-T2 site ( Legnaro). The configuration for the Job submission is ongoing ( Andrea S is following this up)
  • Design of the MW software for the verification activity is ongoing, a first prototype to be delivered before the next Meeting (1st October)
  • Book your diaries!! Next meeting on October 1st at CERN with audioconf and vidyo. Provisional Agenda here

Multicore Deployment

  • ATLAS: 11 T1s and 35 T2s are currently running multicore for ATLAS with an average slot occupancy of 30k slots up to 57k. Among Tier2s production is still dominated by US dedicated sites but exra EU shared sites have been added in the past month. 1 French, 3 UK and 4 Italian Tier2s are among the sites added over the summer. Most of this sites have a dynamic or a mixture of dynamic and a small number of initial reserved slots. All sites are requested to enable dynamic batch system support to run multicore and single core jobs concurrently.
  • CMS: Running multicore pilots regularly at all CMS T1s. Multicore pilots testing at several US T2s as well.
  • Passing parameters to the batch systems: as reported in other occasions to apply certain scheduling techniques it is necessary for experiments to pass parameters to the batch systems ahead of being scheduled. To do this there have been many obstacles which need still to be cleared. One of these is the fact that there is not a standard blah script for sites deploying CREAM CE and those that exist are clearly not standardized since were written by sites trying to match what the experiments were sending but without any previous discussion. This has been discussed with ATLAS since there are other use cases for which this is needed for example requesting a specific amount of memory. It was decided the TF would take on board the standardization of the blah scripts (and other CEs scripts if needed) for the scheduling parameters. A discussion about distribution and deployment will also be needed.

SHA-2 Migration TF

  • introduction of the new VOMS servers
    • compliance of the WLCG infrastructure is being tested with the SAM preprod instances
    • EGI broadcast sent Aug 25
    • deadline Mon Sep 15
      • switch SAM production configurations for WLCG on that day
      • gradually switch experiments' job and data systems as of that date
        • we need to get all relevant workflows verified ASAP
      • keep the new servers blocked for outside CERN a while longer?
    • 88 tickets opened on Sep 1 for EGI sites that still failed due to the VOMS switch
    • OSG sites are in very good shape
      • only WT2 (SLAC) SRM still failing due to the VOMS switch
    • EGI will run a campaign for Ops via the NGIs and ROCs
      • first try to ensure sites are configured OK
      • then switch SAM-Nagios instances
    • the old VOMS servers will continue running until the end of Nov
      • their host certs expire by then

WMS Decommissioning TF

  • Condor-based SAM probes
    • Deployment to production planned on Wed 1st of October 2014

IPv6 Validation and Deployment TF

Nothing to report.

Squid Monitoring and HTTP Proxy Discovery TFs

  • Automated MRTG monitor needs a few tweaks before we can start registration campaign, in particular
    1. Use site names when reading from OIM instead of service names
    2. Distinguish between multiple squid services registered at a site
    3. Use different service type from OIM after OIM is changed to not distinguish between Frontier & CVMFS squids
  • Made a draft of a WLCGSquidRegistration instruction page. Will definitely need more work.
  • Agreed in a meeting to incorporate Costin Grigoras' new MonALISA-based monitor into the wlcg-squid-monitor as an additional interface and to use his rpm-based host collector as an extra optional method to collect extra monitoring data. So far we still expect every site to also make their squid monitoring accessible via SNMP.

Action list

AOB

-- MariaALANDESPRADILLO - 20 Jun 2014

Edit | Attach | Watch | Print version | History: r28 < r27 < r26 < r25 < r24 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r25 - 2014-09-04 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback