WLCG Operations Coordination Minutes, June 2nd 2016

Highlights

  • edg-mkgridmap 4.0.2 is not working anymore with this update. A new version (4.0.3) already available for CentOS7 fixes the issue and it’s backward compatible. It has been pushed to the WLCG repo for SL6 and it’s already available in EPEL testing.
  • Found a problem affecting FTS Soap interface (only used by CMS Phedex) where the Optimizer is not enabled. Problem introduced since v 3.3.0. The fix released in FTS 3.4.5 has been already deployed at CERN and the others will update starting from next week. (There is a workaround to manually enable the optimizer already applied at RAL).
  • The HTTP Deployment TF has now concluded. A summary of the TF's activities has been added to the twiki - https://twiki.cern.ch/twiki/bin/view/LCG/HTTPDeployment.
  • There is a new WLCG Accounting TF that will focus on CPU/Wallclock time accounting, fixing the accounting reports and improving the accounting portal. More details in the twiki https://wlcg-ops.web.cern.ch/wlcg-accounting.
  • There is an important change in the ALICE SAM profile.
  • A status update of the IT WLCG Monitoring Consolidation project was presented during the meeting. Check slides for more details.

Agenda

Attendance

  • local: Maria Dimou, Maria Alandes, Andrew McNab, Maarten Litmaath, Giuseppe Bagliesi, Jerome Belleman, Alberto Aimar, Oliver Keeble, Raja Nandakumar, Andrea Manzi, Marian Babik, Andrea Sciaba, Alessandro Di Girolamo
  • remote: Jens Larsson, Nurcan Ozturk, Robert Ball, Thomas Beermann, Di Qing, Stephan Lammel, Thomas Hartmann, Christoph Wissing, Antonio Maria Perez Calero Yzquierdo, Mario Lassnig, Dave Mason, Di Qing, Aresh, Eric Lancon, Alessandra Forti, Eddie Karavakis, Luca Magnoni, Gareth Smith, Rob Quick, Victor Zhiltsov, Pepe Flix, Renaud Vernet
  • apologies: Alessandra Doria

Operations News

As the last Ops Coord meeting took place on April 28th, here are some of May's events:

New WLCG Accounting TF

Middleware News

  • Baselines/News:
    • SLC6.8/CentOS 6.8 is out: (standard SL will soon follow)
      • contains a fix for an openldap crash affecting TopBDII and ARC-CE resource BDII. Already tested at CERN and working fine. The production BDII at CERN will soon be updated.
      • edg-mkgridmap 4.0.2 is not working anymore with this update. A new version (4.0.3) already available for CentOS7 fixes the issue and it’s backward compatible. It has been pushed to the WLCG repo for SL6 and it’s already available in EPEL testing
    • UMD 4 is out initially with UMD 3 packages for SL6 plus first CentOS7/Ubuntu packages available. A new updates (4.1.0) has been released already it contains a new version of Storm ( 1.11.11) which fixes a vulnerability on the WebDAV interface (http://repository.egi.eu/2016/05/27/storm-1-11-11/). To become WLCG baseline.
    • New version of Xrootd Atlas N2N plugin available on WLCG repo. It contains performance improvements related to the latest version of Xrootd.
    • A first testing version of the UI for CentOS 7 has been published to the CVMFS grid.cern.ch (/cvmfs/grid.cern.ch/centos7-ui-test). The process to release it in UMD has started.
  • Issues:
    • Found a problem affecting FTS Soap interface (only used by CMS Phedex) where the Optimizer is not enabled. Problem introduced since v 3.3.0. The fix released in FTS 3.4.5 has been already deployed at CERN and the others will update starting from next week. (There is a workaround to manually enable the optimizer already applied at RAL)
    • glite-ce-job-submit to CREAM-CEs with dual-stack fails from CERN (GGUS:120586). Problem affecting both LHCb and ATLAS.
    • Several vulnerabilities have been discovered during the last months (not public yet). Sites have been contacted by EGI SVG broadcasts or individually.

  • T0 and T1 services
    • BNL
      • moved to Xrootd 4.3.0
    • CERN
      • Xrootd on Castor nodes upgraded to 4.3.0
      • FTS upgraded to v 3.4.5
      • TopBDII upgrade to SLC6.8 planned
      • possible Castor & EOS for ATLAS upgrade next week during the technical stop
    • CNAF
      • StoRM upgrade to 1.11.11
    • IN2P3
      • dCache upgrade to  2.13.26 (on pools) 
      • global dCache upgrade to 2.13.30 for June downtime 
      • xrootd Alice T2 (disk only) upgrade to 4.2.3-3 for June downtime
    • JINR
      • dCache  upgrade to 2.10.59; postgresql update to 9.4.8
    • NDGF
      • dCache upgraded to 2.16 pre release
    • TRIUMF
      • Plan to upgrade to dCache 2.13.30 on June 8th

Raja mentioned that the dual-stack problem in CREAM CEs starts to be a blocking issue for LHCb, who already raised this at the WLCG Operations meeting on Monday. It is agreed to raise the priority of the ticket in GGUS.

Tier 0 News

  • Overloads and saturations of the Meyrin-Wigner link have been investigated; some occurrences have been traced back go an experiment accessing several petabytes of very old data recorded in Meyrin before the Wigner capacity became available. These data may need to be rebalanced. The problem will be alleviated when the third 100 Gbps link will become available; progress has been made on identifying the local loop in Budapest.

  • Poor performance of CMS transfers to Wigner have been traced to the behaviour of CERN's switches; flow control will be added to the switches on 06-Jun-2016.

  • A new configuration of the ALICE->CASTOR system was validated demonstrating stable running conditions. Work is going on to further increase the ability to empty the ALICE DAQ disks to catch up after potential transfer interruptions.

  • A new version of ARGUS, supposedly addressing the long-standing stability issues, was found to have introduced two new bugs, and was not deployed. We are awaiting further fixes by the developer.

  • The CREAM CEs have shown instabilities that we are trying to address temporarily by using larger hosts. At the same time, investigations are going on on a number of potential options to resolve the issue.

  • A scale test with CMS for writing into EOS is being prepared in order to validate configuration changes intended to increase the transfer rates as needed by the activities of the heavy-ion community.

  • The modified planning for the LHC technical stop in June has had an impact on the Tier-0 activity planning. It is now foreseen to deploy a number of security, functional and operating system changes in the DB services pending since the previous technical stop; subject to approval by ATLAS, a minor EOS upgrade (to 0.3.171) may be applied as well.

  • The strategy for CPU resources for CMS (both Tier-0 and Tier-2 capacity at CERN) will be discussed further with the CMS computing management.

DB News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • High activity on average
  • SAM: the critical profile for A/R computations now includes selected MonALISA metrics
    • see http://wlcg-sam-alice.cern.ch
    • in particular the SE services are now taken into account
    • the profile will be used as of May 1st (sic)
      • but this first month we are flexible with corrections
  • CERN: CREAM job submission incident on Sunday May 22 (GGUS:121699)
    • None of the CEs worked
    • Most had their sandbox partitions full
    • The main usage was due to ALICE jobs, apologies!
    • An ad-hoc cleanup was done
    • New jobs will by default not make use of output sandboxes
      • they are only needed for debugging

ATLAS

  • Large scale MC15c digi+reco (MCORE) production campaign is done, ~5B events produced in ~3 months. MC16 simulation is being tested in the system now.
  • Derivation production is also done on the reprocessed data and MC15c samples. Total volume of derivations exceeded the budget by ~50%, being discussed. New round of derivation production for data+MC15c will start soon with updated software.
  • CERN resources not fully used due to ATLAS Pilot Factory/Cream problem (Condor's CREAM_DELEGATION error seen on some CREAM CEs), followed up at https://its.cern.ch/jira/browse/ASPDA-196, Condor team provided a patched binary for condor 8.4, will be deployed.
  • Direct I/O instead of copy-to-scratch mode is being tested with HammerCloud and derivations jobs, the latter being the main customer.

CMS

  • data taking resumed Thursday 26th afternoon
  • CMS StorageManager team discussing with CERN-IT to optimize transfers from P5 to CERN EOS in order to cope with the rate needed for HI run at the end of the year

  • MC DigiReco production for ICHEP continuing
  • T1/T2 quotas increased to match 2016 disk pledges

  • Enabling multi-core pilots: 36800 T1, 81000 T2 slots enabled
  • Will start to check IPV6 dual IP stack with several services
    • Myproxy, VOMS, CVMFS, FTS, PhEDEx, Frontier, Xrootd, cmsweb

  • We look forward for a first CRIC trial setup/prototype for CMS

  • Saturday 30th, TWiki including shifter instructions were not reachable

LHCb

  • Still validating workflows for 2016 data processing
    • After validation is finished do not expect strain on sites, as files are staged.
    • We are however migrating RAW files to tape (i.e. quite some load on Castor and low pressure at Tier1s, typically 150 MB/s) and we shall migrate RDST files when produced.
  • Generally low activity levels on the grid

  • Infrastructure :
    • Commissioning more dual-stack VO-boxes and moving LHCbDirac services to them

  • T1 sites:
    • RAL : Two diskservers went down on Tuesday. Back on Wednesday afternoon

Ongoing Task Forces and Working Groups

HTTP Deployment TF

Information System Evolution


  • Evaluation with REBUS developers and AGIS developers of the capacity view to consider whether it can be dropped from REBUS
  • Ongoing discussions with EGI and relevant experts (MW Officer, Security) on evaluating the impact of stopping BDII
  • Testing deployment of CRIC prototype as well as playing with first CMS data on it. More details in CRIC Evaluation
  • WLCG scope tags in GOCDB in the process to be validated (33 out of 123 tickets still open). Thanks to the sites for their effort. More details in VO Tags validation twiki. There are ongoing discussions to decide whether it makes sense to repeat this validation on a regular basis and whether it makes sense to compare with VOfeeds and maintain a 1 to 1 relationship between tags and resources in VOfeed.
  • Next TF meeting scheduled on 16.06.2016

IPv6 Validation and Deployment TF


Next week's pre-GDB is devoted to IPv6, as well as two-hours slot in the GDB. The main topics to be discussed are:

  • Experiment requirements
  • Status of support to IPv6-only CPUs
  • Experience on dual-stack services
  • Monitoring and IPv6
  • Security and IPv6
  • Status of WLCG tiers and LHCOPN/LHCONE

Alberto asks how the IPv6 monitoring is actually performed. Andrea explains that this is done via site monitoring with ETF and network monitoring via perfSONAR.

Machine/Job Features TF

Working through mandate's steps:

  • Check the completeness of the proposal for machine/job features (The specification was published as HSF-TN-2016-02)
  • Coordinate implementations used in WLCG and an interface for its usage to the VOs (MJF for PBS/Torque, HTCondor, and Grid Engine in production)
  • Provide means to monitor the correctness of the provided information (Probe for ETF/SAM now on etf-lhcb-preprod)
  • Plan and execute the deployment of those implementations at all WLCG resources

MJF deployments on batch passing tests at Cambridge (Torque), GridKa (Grid Engine), Liverpool(HTCondor), Manchester (Torque), PIC (Torque), RAL PPD (HTCondor). Aware of other sites currently working on deployment (e.g. RAL Tier-1 and CERN.) UK intends to deploy at all sites.

All Vac/Vcycle VM-based sites have MJF available as part of the VM management system.

Status talk given at GDB in May, which has more about the specification and implementations.

Middleware Readiness WG


  • The last WG meeting took place on May 18th see here the Minutes.
  • Based on the T0 and T1 feedback the meeting concluded that the testbeds at the volunteer sites will be the only ones required to run the pakiti client. This was also presented at the May 24th MB. The need to have a tool that securely reveals what is installed at a site will be followed-up by the Information Systems' Evolution TF. Detailed replies from sites are present in here.
  • Information on the CentOS7 experiments' intentions is needed. Please prepare for the next meeting on...
  • proposed date July 6th at 4pm CEST.

Maria asks LHCb to provide feedback for CC7 deployment plans as they haven't given feedback so far in the working group. Raja explains that LHCb is still working on several things from the Dirac side to enable CC7. Andrew confirms that there will be active participation of LHCb in future meetings.

Network and Transfer Metrics WG


  • WLCG Network Throughput SU:
    • ASGC connectivity - After numerous tests performed in collaboration with ASGC and ESNet (http://etf.cern.ch/perfsonar_asgc.txt) the root cause has been confirmed to be the local N7K router at ASGC. Once the perfSONARs were moved directly to the central router the measured network performance has improved by factor 10. Our recommendation is to re-wire all the existing data transfer nodes to bypass the local router as well as to tune the central router and data transfer nodes to improve their performance for long path transfers (200ms+).
    • Two new tickets received related to packet loss observed at RAL and SARA
  • Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
  • North American Throughput meeting held on 1st of June:
    • Shawn presented OSG network area roadmap, main focus will be on developing notification/alerting, support for higher-level services (analytics) and prepare for SDN
    • Next meeting is on 22nd June - main topic will be perfSONAR 4.0
  • WLCG Throughput meeting will be held on 16th of June - main topic is re-organization of the meshes
  • perfSONAR 4.0 (formerly 3.6) RC expected end of June

RFC proxies

  • EGI and OSG have been asked if RFC can be made the default proxy type
    • new version of VOMS clients
    • new version of YAIM core for myproxy-init
    • both changes are trivial
    • EGI will get back soon, OSG did not reply yet
  • any remaining compatibility concerns for central services?
  • pilot factories?
    • the main concern for them was the use of old CREAM code in HTCondor
      • that also happens to use MD5 in proxy delegations
    • all factories should be OK now?

Maarten asks whether for ATLAS and CMS pilot factories, everything is now OK. ATLAS confirms so. Giuseppe will follow up for CMS. Rob mentions that OSG will also confirm very soon whether RFC proxies could be the default ones. Maarten explains that voms-proxy-init current version gives you by default a legacy proxy. The idea is that with the new version, the default will be RFC proxies. arc-proxy-init is already RFC proxy by default.

Squid Monitoring and HTTP Proxy Discovery TFs

Nothing to report.

IT WLCG Monitoring Consolidation

Alberto presents the status of the current IT WLCG monitoring consolidation project and the overall strategy that is being followed up. The idea is to move to more standard technologies, moving away from CERN specific solutions. This will allow to have community support and also make CERN IT jobs more attractive with the use of popular and widespread technologies. It is also mentioned that different problems may require different solutions, this is the reason why the spectrum of proposed technologies is very wide.

After a question on whether in the new dashboard shown during the demo it will be possible to use experiment specific site names, Alberto explains that this is all under development and that of course all features needed by experiments will be tried to be addressed. The idea is also to count on experiment's expertise of the used technologies to develop plugins and extra features specific to each experiment. The advantage of using standard technologies is that internal experiment expertise may be applied directly, without the need for expert assistance or development by the monitoring team.

There is also a question on whether the notification system could be made customisable. Alberto explains that in principle this could be customisable.

It was discussed how to proceed in the future. Alberto explains that at the moment they are in touch with experts within the experiments to test the different technologies and select relevant use cases to start with. When there is interesting new functionality available, everyone agrees that there could be updates in the Ops Coord meeting.

Action list

Creation date Description Responsible Status Comments
17.03.2016 ALICE, ATLAS and LHCb can stop gLExec tests at their convenience. Recipes have been provided. Maarten DONE  

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
29.04.2016 Unify HTCondor CE type name in experiments VOfeeds all - Proposal to use HTCONDOR-CE. Waiting for VOfeed responsibles to confirm this is updated   Ongoing

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion

AOB

-- MariaALANDESPRADILLO - 2016-06-01

Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback