WLCG Operations Coordination Minutes, June 2nd 2016

Highlights

Agenda

Attendance

  • local:
  • remote:
  • apologies: Alessandra Doria

Operations News

As the last Ops Coord meeting took place on April 28th, here are some of May's events:

New WLCG Accounting TF

Middleware News

  • Baselines/News:
    • SLC6.8/CentOS 6.8 is out: (standard SL will soon follow)
      • contains a fix for an openldap crash affecting TopBDII and ARC-CE resource BDII. Already tested at CERN and working fine. The production BDII at CERN will soon be updated.
      • edg-mkgridmap 4.0.2 is not working anymore with this update. A new version (4.0.3) already available for CentOS7 fixes the issue and it’s backward compatible. It has been pushed to the WLCG repo for SL6 and it’s already available in EPEL testing
    • UMD 4 is out initially with UMD 3 packages for SL6 plus first CentOS7/Ubuntu packages available. A new updates (4.1.0) has been released already it contains a new version of Storm ( 1.11.11) which fixes a vulnerability on the WebDAV interface (http://repository.egi.eu/2016/05/27/storm-1-11-11/). To become WLCG baseline.
    • New version of Xrootd Atlas N2N plugin available on WLCG repo. It contains performance improvements related to the latest version of Xrootd.
    • A first testing version of the UI for CentOS 7 has been published to the CVMFS grid.cern.ch (/cvmfs/grid.cern.ch/centos7-ui-test). The process to release it in UMD has started.
  • Issues:
    • Found a problem affecting FTS Soap interface (only used by CMS Phedex) where the Optimizer is not enabled. Problem introduced since v 3.3.0. The fix released in FTS 3.4.5 has been already deployed at CERN and the others will update starting from next week. (There is a workaround to manually enable the optimizer already applied at RAL)
    • glite-ce-job-submit to CREAM-CEs with dual-stack fails from CERN (GGUS:120586). Problem affecting both LHCb and ATLAS.
    • Several vulnerabilities have been discovered during the last months (not public yet). Sites have been contacted by EGI SVG broadcasts or individually.

  • T0 and T1 services
    • BNL
      • moved to Xrootd 4.3.0
    • CERN
      • Xrootd on Castor nodes upgraded to 4.3.0
      • FTS upgraded to v 3.4.5
      • TopBDII upgrade to SLC6.8 planned
      • possible Castor & EOS for ATLAS upgrade next week during the technical stop
    • CNAF
      • StoRM upgrade to 1.11.11
    • IN2P3
      • dCache upgrade to  2.13.26 (on pools) 
      • global dCache upgrade to 2.13.30 for June downtime 
      • xrootd Alice T2 (disk only) upgrade to 4.2.3-3 for June downtime
    • JINR
      • dCache  upgrade to 2.10.59; postgresql update to 9.4.8
    • NDGF
      • dCache upgraded to 2.16 pre release
    • TRIUMF
      • Plan to upgrade to dCache 2.13.30 on June 8th

Tier 0 News

  • Overloads and saturations of the Meyrin-Wigner link have been investigated; some occurences have been traced back go an experiment accessing several petabytes of very old data recorded in Meyrin before the Wigner capacity became available. These data may need to be rebalanced. The problem will be alleviated when the third 100 Gbps link will become available; progress has been made on identifying the local loop in Budapest.

  • Poor performance of CMS transfers to Wigner have been traced to the behaviour of CERN's switches; flow control will be added to the switches on 06-Jun-2016.

  • A new configuration of the ALICE->CASTOR system was validated demonstrating stable running conditions. Work is going on to further increase the ability to empty the ALICE DAQ disks to catch up after potential transfer interruptions.

  • A new version of ARGUS, supposedly addressing the long-standing stability issues, was found to have introduced two new bugs, and was not deployed. We are awaiting further fixes by the developer.

  • The CREAM CEs have shown instabilities that we are trying to address temporarily by using larger hosts. At the same time, investigations are going on on a number of potential options to resolve the issue.

  • A scale test with CMS for writing into EOS is being prepared in order to validate configuration changes intended to increase the transfer rates as needed by the activities of the heavy-ion community.

  • The modified planning for the LHC technical stop in June has had an impact on the Tier-0 activity planning. It is now foreseen to deploy a number of security, functional and operating system changes in the DB services pending since the previous technical stop; subject to approval by ATLAS, a minor EOS upgrade (to 0.3.171) may be applied as well.

  • The strategy for CPU resources for CMS (both Tier-0 and Tier-2 capacity at CERN) will be discussed further with the CMS computing management.

DB News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • High activity on average
  • SAM: the critical profile for A/R computations now includes selected MonALISA metrics
    • see http://wlcg-sam-alice.cern.ch
    • in particular the SE services are now taken into account
    • the profile will be used as of May 1st (sic)
      • but this first month we are flexible with corrections
  • CERN: CREAM job submission incident on Sunday May 22 (GGUS:121699)
    • None of the CEs worked
    • Most had their sandbox partitions full
    • The main usage was due to ALICE jobs, apologies!
    • An ad-hoc cleanup was done
    • New jobs will by default not make use of output sandboxes
      • they are only needed for debugging

ATLAS

  • Large scale MC15c digi+reco (MCORE) production campaign is done, ~5B events produced in ~3 months. MC16 simulation is being tested in the system now.
  • Derivation production is also done on the reprocessed data and MC15c samples. Total volume of derivations exceeded the budget by ~50%, being discussed. New round of derivation production for data+MC15c will start soon with updated software.
  • CERN resources not fully used due to ATLAS Pilot Factory/Cream problem (Condor's CREAM_DELEGATION error seen on some CREAM CEs), followed up at https://its.cern.ch/jira/browse/ASPDA-196, Condor team provided a patched binary for condor 8.4, will be deployed.
  • Direct I/O instead of copy-to-scratch mode is being tested with HammerCloud and derivations jobs, the latter being the main customer.

CMS

  • data taking resumed Thursday 26th afternoon
  • CMS StorageManager team discussing with CERN-IT to optimize transfers from P5 to CERN EOS in order to cope with the rate needed for HI run at the end of the year

  • MC DigiReco production for ICHEP continuing
  • T1/T2 quotas increased to match 2016 disk pledges

  • Enabling multi-core pilots: 36800 T1, 81000 T2 slots enabled
  • Will start to check IPV6 dual IP stack with several services
    • Myproxy, VOMS, CVMFS, FTS, PhEDEx, Frontier, Xrootd, cmsweb

  • We look forward for a first CRIC trial setup/prototype for CMS

  • Saturday 30th, TWiki including shifter instructions were not reachable

LHCb

  • Still validating workflows for 2016 data processing
    • After validation is finished do not expect strain on sites, as files are staged.
    • We are however migrating RAW files to tape (i.e. quite some load on Castor and low pressure at Tier1s, typically 150 MB/s) and we shall migrate RDST files when produced.
  • Generally low activity levels on the grid

  • Infrastructure :
    • Commissioning more dual-stack VO-boxes and moving LHCbDirac services to them

  • T1 sites:
    • RAL : Two diskservers went down on Tuesday. Back on Wednesday afternoon

Ongoing Task Forces and Working Groups

HTTP Deployment TF

Information System Evolution


  • Evaluation with REBUS developers and AGIS developers of the capacity view to consider whether it can be dropped from REBUS
  • Ongoing discussions with EGI and relevant experts (MW Officer, Security) on evaluating the impact of stopping BDII
  • Testing deployment of CRIC prototype as well as playing with first CMS data on it. More details in CRIC Evaluation
  • WLCG scope tags in GOCDB in the process to be validated (33 out of 123 tickets still open). Thanks to the sites for their effort. More details in VO Tags validation twiki. There are ongoing discussions to decide whether it makes sense to repeat this validation on a regular basis and whether it makes sense to compare with VOfeeds and maintain a 1 to 1 relationship between tags and resources in VOfeed.
  • Next TF meeting scheduled on 16.06.2016

IPv6 Validation and Deployment TF


Next week's pre-GDB is devoted to IPv6, as well as two-hours slot in the GDB. The main topics to be discussed are:

  • Experiment requirements
  • Status of support to IPv6-only CPUs
  • Experience on dual-stack services
  • Monitoring and IPv6
  • Security and IPv6
  • Status of WLCG tiers and LHCOPN/LHCONE

Machine/Job Features TF

Working through mandate's steps:

  • Check the completeness of the proposal for machine/job features (The specification was published as HSF-TN-2016-02)
  • Coordinate implementations used in WLCG and an interface for its usage to the VOs (MJF for PBS/Torque, HTCondor, and Grid Engine in production)
  • Provide means to monitor the correctness of the provided information (Probe for ETF/SAM now on etf-lhcb-preprod)
  • Plan and execute the deployment of those implementations at all WLCG resources

MJF deployments on batch passing tests at Cambridge (Torque), GridKa (Grid Engine), Liverpool(HTCondor), Manchester (Torque), PIC (Torque), RAL PPD (HTCondor). Aware of other sites currently working on deployment (e.g. RAL Tier-1 and CERN.) UK intends to deploy at all sites.

All Vac/Vcycle VM-based sites have MJF available as part of the VM management system.

Status talk given at GDB in May, which has more about the specification and implementations.

Middleware Readiness WG


Network and Transfer Metrics WG


  • WLCG Network Throughput SU:
    • ASGC connectivity - After numerous tests performed in collaboration with ASGC and ESNet (http://etf.cern.ch/perfsonar_asgc.txt) the root cause has been confirmed to be the local N7K router at ASGC. Once the perfSONARs were moved directly to the central router the measured network performance has improved by factor 10. Our recommendation is to re-wire all the existing data transfer nodes to bypass the local router as well as to tune the central router and data transfer nodes to improve their performance for long path transfers (200ms+).
    • Two new tickets received related to packet loss observed at RAL and SARA
  • Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
  • North American Throughput meeting held on 1st of June:
    • Shawn presented OSG network area roadmap, main focus will be on developing notification/alerting, support for higher-level services (analytics) and prepare for SDN
    • Next meeting is on 22nd June - main topic will be perfSONAR 4.0
  • WLCG Throughput meeting will be held on 16th of June - main topic is re-organization of the meshes
  • perfSONAR 4.0 (formerly 3.6) RC expected end of June

RFC proxies

  • EGI and OSG have been asked if RFC can be made the default proxy type
    • new version of VOMS clients
    • new version of YAIM core for myproxy-init
    • both changes are trivial
    • EGI will get back soon, OSG did not reply yet
  • any remaining compatibility concerns for central services?
  • pilot factories?
    • the main concern for them was the use of old CREAM code in HTCondor
      • that also happens to use MD5 in proxy delegations
    • all factories should be OK now?

Squid Monitoring and HTTP Proxy Discovery TFs

IT WLCG Monitoring Consolidation

Action list

Creation date Description Responsible Status Comments
17.03.2016 ALICE, ATLAS and LHCb can stop gLExec tests at their convenience. Recipes have been provided. Maarten DONE  

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
29.04.2016 Unify HTCondor CE type name in experiments VOfeeds all - Proposal to use HTCONDOR-CE. Waiting for VOfeed responsibles to confirm this is updated   Ongoing

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion

AOB

-- MariaALANDESPRADILLO - 2016-06-01

Edit | Attach | Watch | Print version | History: r22 | r20 < r19 < r18 < r17 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r18 - 2016-06-02 - AndreaManzi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback