WLCG Operations Coordination Minutes, July 6th 2017

Highlights

Agenda

Attendance

  • local: Andrea M (MW Officer + data management), Andrea S (IPv6), Gavin (T0), Julia (WLCG), Maarten (WLCG + ALICE)

  • remote: Alessandra D (Napoli), Alessandra F (Manchester + ATLAS), Alessandro (CNAF), Brian (RAL), Catherine (LPSC + IN2P3), David B (IN2P3-CC), David C (Glasgow), David M (FNAL), Di (TRIUMF), Eric (IN2P3-CC), Felix (ASGC), Gareth (RAL), Giuseppe (CMS), Javier (IFIC), Jeremy (GridPP), Kyle (OSG), Marcelo (LHCb), Marcin (PSNC), Renaud (IN2P3-CC), Ron (NLT1), Sang-Un (KISTI), Thomas (DESY), Vikas (VECC), Xin (BNL)

  • apologies: Marian (networks), ATLAS

Operations News

  • WLCG workshop took place from the 19th to the 22nd of June hosted by the University of Manchester. Thank you to our Manchester colleagues, in particular Alessandra for excellent organization. The operations session chaired by Pepe included many important areas like benchmarking, monitoring, IS system evolution and storage space accounting. More details can be found here

  • Pre-GDB on containers will be held Tue July 11 afternoon
  • GDB will be held on Wed July 12

  • the next meeting is planned for Sep 14
    • please let us know if that date would present a major issue

Middleware News

  • Useful Links:
  • Baselines/News:
    • Globus EOL in 2018 (https://www.globus.org/blog/support-open-source-globus-toolkit-ends-january-2018).
      • So far it looks likely that CERN together with OSG will take over the code maintenance and support in the short term, hopefully with the continued participation of a person from NDGF. In the longer term we will look at how this code should be replaced, in particular gsi and gridftp. Essentially this is a non-issue for now.
    • Perfsonar Baseline moved to v4.0.0 ( from last meeting), removed dCache 2.13 from baseline and added dCache 2.16.39
    • dCache 2.13.x EOL on June, only KIT and FNAL among T1s are still running this version.
    • Some new products are expected to be released in UMD4 within this month:
    • As broadcasted by C Aiftimiei, the EMI repos have been shutdown on 15/06.
  • Issues:
  • T0 and T1 services
    • CERN
      • Castor upgrade to 2.1.16-18 for all VOs, diskserver migration to C7
      • 2 load balanced HAProxy servers deployed in front of Production FTS
    • IN2P3
      • Major dCache upgrade to v2.16.37
      • Upgrade of Xrootd during the next stop in september
    • JINR
      • Minor dCache upgrade 2.16.31 -> 2.16.39 on both instances;
      • minor xrootd upgrade 4.5.0-2.osg33 -> 4.6.1-1.osg33 for CMS
    • KISTI
      • xrootd upgrade from v3 to v4.4.1 for tape
    • NL-T1:
      • SURFsara Major dCache upgrade to 2.16.36 on June 6-7
    • RAL:
      • Castor stagers updated to 2.1.16-13 and SRMs to 2.1.16-0.
      • All data now on T10KD drives/media.
      • Upgrade of FTS "prod" instance delayed due to non-LHC VOs usage of SOAP API. Hope to be able to upgrade during July
    • TRIUMF:
      • Major dCache upgrade to v2.16.39

Discussion

Tier 0 News

  • Storage: see above
  • Batch capacity increases ongoing

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • The activity levels typically have been very high
    • The average was 112k running jobs, with a new record of 143k on May 29
  • CERN
    • Some fallout from the DNS incident on June 18
  • No other major problem

ATLAS

  • stable production at 300k cores, about 80k used for derivations.
  • derivation production is causing too many transfers, need to further optimize the workflow (eg 70 outputs/ mcore job)
  • ongoing ATLAS P1 to EOS to CASTOR data throughput test to fully validate (at approx double of the nominal rate) the data workflow from ATLAS experiment to the tape infrastructure.
  • ongoing efforts to understand sites not performing very well (high - wrt to the average of the other sites- wallclock time wasted).

CMS

  • CMS Detector
    • Commissioning progressing
    • Most effort goes into the new pixel detector
  • Processing activities
    • Overall utilization rather moderate
    • Finished a RE-RECO of 2016 data
    • Main MC production campaign for 2017 still in preparation
    • Small (but urgent) RE-RECOs of recent 2017 data for commissioning
  • Sites
    • Deprecation of stage-out plugins
    • In contact with sites to test IPv6 readiness of storages
  • EOS
    • Suffered from limitations in GSI authentication capacity - fixed
    • Identified a source of occasional file corruptions: Improper handling of write recoveries
      • Can be circumvented by setting environment variable
      • Details: GGUS:127993
  • EL7 migration
    • Found some issues with Singularity under some configuration circumstances
    • Recommendation is to wait with migration, if possible
  • Rising interest in CMS to use MPI compute resources for certain generators
    • Sites, that want to provide such resources should contact Stephan Lammel and Giuseppe Bagliesi

LHCb

  • High activity on the grid, keeping an average of 60K jobs

  • CERN
    • The proxy expiration problem in HTCondor CEs still being investigated. ( GGUS:129147 )

Ongoing Task Forces and Working Groups

Accounting TF

  • Progress on the storage space accounting prototype has been reported at the WLCG Workshop
  • At the latest Accounting TF meeting in May discussed the plan to add raw wallclock job duration to the accounting portal as a separate metric. Currently wallclock time can contain either raw or scaled wallclock time. APEL colleagues presented EGI work regarding storage space accounting.

Information System Evolution TF

  • The IS evolution plans and progress in CRIC development have been presented at the WLCG Workshop in Manchester


IPv6 Validation and Deployment TF


  • Andrea S:
    • we will prepare a campaign for T2 sites to start looking into their IPv6 preparations
    • it will be started at a small scale, to gain experience before all sites are contacted
    • we probably need a GGUS support unit and a mailing list
    • the text sent to the sites needs to be very clear
    • we aim to have dual-stack deployment of storage services at the vast majority of sites by the end of Run 2
  • Julia:
    • there should be a communication channel for sites to share experiences
    • a Twiki page would be helpful for recipes etc.

  • Julia: did the IPv6 session at the workshop go OK?
  • Andrea S:
    • there were ~30 people in the hands-on session
    • the exercises were easy and went well

Machine/Job Features TF

Current status

MJF hosts (all sites) total: 158

  • Hosts OK: 25
  • Hosts WARNING: 15
  • Hosts CRITICAL: 112

The warnings/errors are of just a few types (configuration mistakes), and it looks like no much effort is required to correct them. Namely: WARNING

  • Warning Key hs06 absent (or empty): 11
  • Warning Key max_swap_bytes absent (or empty): 4

CRITICAL

  • Error Environment variable MACHINEFEATURES not set: 98
  • Error Environment variable JOBFEATURES not set: 2
  • Error Key total_cpu absent (or empty): 10
  • Error Key cpu_limit_secs absent (or empty): 2

Propagation of MJF to other experiments requires some amount of work. In particular, Antonio (aperez@picNOSPAMPLEASE.es) wrote about CMS:

  • CMS SI have worked with glideinWMS (our pilot system) developers to incorporate the information published as MJF into our pilots (where available). So potentially we could add new features (such as job-masonry, but also signaling job/node shutdown times) when the rest of dependencies are solved. One of those dependencies of course will be the deployment of MJF to the CMS sites not shared with LHCb.

Monitoring

MW Readiness WG


This is the status of jira ticket updates since the last Ops Coord of 20170518:

  • MWREADY-146 - dCache 2.16.34 verification for ATLAS @ TRIUMF with IPV6 as well - completed ( there has been a problem when TRIUMF updated the production unfortunately not spotted in the testing instance)
  • MWREADY-145 - The latest version of the WN metapackage for C7 has been released ( v 4.0.5 - renamed to wn), and tested by Liverpool. The metapackage is under inclusion in UMD4 (GGUS:128753)
  • MWREADY-147 - ARC-CE 5.3.1 under testing at Brunel.
  • MWREADY-148 - New CREAM-CE for C7: we agreed with M. Sgaravatto to do the testing for CMS at LNL.

Network and Transfer Metrics WG


Squid Monitoring and HTTP Proxy Discovery TFs

  • CMS frontier at CERN is now using http://grid-wpad/wpad.dat with IPv6 in production. ATLAS frontier at CERN has been all this time randomly using squids at Geneva and Wigner, regardless of the location of the worker nodes, causing much traffic to go over the long distance links. They are now making plans to start using http://grid-wpad/wpad.dat to select local squids.

Traceability and Isolation WG

Special topics

MW deployment forums and feedback

presentation

  • Gavin: we take HTCondor unchanged
  • Maarten: but you enhance it e.g. with the BDII info provider;
    furthermore, the matter is not just about patches, but deployment in general

  • Julia:
    • the fts3-steering list is a good example, though only involving VOs and devs
    • in general the fora would need to allow VOs, sites and devs to participate
    • feedback from sites should be collected and made easily available for others
      • deployment documentation, workarounds etc.

  • Maarten:
    • the MW Readiness WG is the right place to have such things organized
    • in the Sep meeting we will have a checkpoint on the progress

Theme: Providing reliable storage - IN2P3

presentation

  • Maarten: do you have some services permanently available on a UPS?
  • IN2P3-CC:
    • the whole building is on a UPS with a minimal lifetime of about 30 minutes
    • its main function is to allow switching to the other power line transparently
    • if needed, to extend the lifetime we can start switching off all the WN etc.

  • Julia: how often do you see file losses from tape?
  • IN2P3-CC:
    • typically a few files per month
    • such incidents tend to get revealed during repack operations
  • Xin: couldn't most such files be recovered by the vendor?
  • IN2P3-CC:
    • we usually try other ways to recover the files first (other tapes or copy from another site)
    • even if the vendor manages to recover part of the data, the files typically are corrupted

  • Vikas: what are your RAID group disk size and rebuild times?
  • IN2P3-CC:
    • each disk is 6 to 8 TB, the next ones will be 10 TB
    • we have ~145 TB per server
    • the rebuild time is ~24h
    • we need to rebuild 1 or 2 times per year
  • Vikas: 24h is a rather big window for another disk to fail as well...
  • Maarten: various parameters need to be taken into account and optimized together;
    in the end there will always be a calculated risk...

  • IN2P3-CC:
    • for the evolution of our tape system we see 2 options:
      • move to IBM Jaguar, which would imply replacing the whole library
      • move to LTO, which so far we have only used for backups in TSM
    • we would like to discuss such matters e.g. in HEPiX
    • get an idea on reliability experiences at other sites
  • Alessandro:
    • we have the same matter to deal with at CNAF
    • we have had meetings with several vendors (IBM, Quantum, Spectra Logic)
    • we heard some sites are staying with T10KD for the time being
    • LTO may not be good enough for heavy stage-in and -out operations
    • we support the revival of the tape forum to discuss these things
  • Julia:
    • we will first follow up with the owner of the existing list
    • we will ensure there will be a forum and announce it

Action list

Creation date Description Responsible Status Comments
01 Sep 2016 Collect plans from sites to move to EL7 WLCG Operations Ongoing The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF plan to use EL7 for new HW as of early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress.
Jan 26 update: this matter is tied to the EL7 validation statuses for ATLAS and CMS, which were reported in that meeting.
March 2 update: the EMI WN and UI meta packages are planned for UMD 4.5 to be released in May
May 18 update: UMD 4.5 has been delayed to June
July 6 update: UMD 4.5 has been delayed to July
03 Nov 2016 Review VO ID Card documentation and make sure it is suitable for multicore WLCG Operations Pending Jan 26 update: needs to be done in collaboration with EGI
26 Jan 2017 Create long-downtimes proposal v3 and present it to the MB WLCG Operations Pending May 18 update: EGI collected feedback from sites and propose a compromise - 3 days' notice for any scheduled downtime
18 May 2017 Follow up on the ARC forum for WLCG site admins WLCG Operations In progress  
18 May 2017 Prepare discussion on the strategy for handling middleware patches Andrea Manzi and WLCG operations In progress  
06 Jul 2017 Ensure a forum exists for discussing tape matters WLCG Operations New  

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion

AOB

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback