WLCG Operations Coordination Minutes, May 7th 2015

Agenda

Attendance

  • local: Maria Alandes (chair), Andrea Sciaba' (minutes), Maria Dimou, Marian Babik, Stefan Roiser, David Cameron, Maarten Litmaath, Andrea Manzi
  • remote: Antonio Maria Perez Calero Yzquierdo, Di Qing, Hung-Te Lee, Maite Barroso, Thomas Hartmann, Ulf Tigerstedt, Vincenzo Spinoso, Catherine Biscarat, Rob Quick, Gareth Smith, Alessandra Doria, David Mason, Alberto Aimar, Michael Ernst, Alessandra Forti, Josep Flix, Frederique Chollet

Operations News

  • For those who couldn't attend the WLCG Workshop in Okinawa, please note that there will be a presentation about WLCG workshop conclusions by Alessandra Forti at the GDB next week.
  • Please, remember to enter future meetings in our Indico category [WLCG Operations Coordination to avoid meeting clashes. It seems this has happened recently so we should be able to see in Indico when other meetings are scheduled and plan consequently.

Stefan asked to put in Indico also the future WLCG ops coord meetings.

Middleware News

  • Baselines:
    • UMD 3.12.0 release this week:
      • ARGUS-PAP, v. 1.6.4. fixing a blocking issue related to the latest JAVA upgrade
      • dCache server, v.2.10.24. Bug fixes . Verified at TRIUMF by MW readiness WG as well.
    • dCache 2.6.x removed from baselines as support ends in June ( 31 instances still running it, no T1s ( FNAL have 2.2. patched)).
      • Added new golden release branch with v 2.12.5, verified by NDGF
    • Just released a new version 2.10.28/ 2.12.8 of dCache fixing an important issue, some sites are planning to upgrade/have already upgrade. Would like to test it in MW readiness first ( i.e. at TRIUMF)

Maarten noted that Tier-1 sites using dCache should try to avoid scheduling the upgrade at the same time, as it will take several hours of downtime. This will be further communicated at next Monday's 3 o'clock operations meeting.

  • MW Issues:
    • A major upgrade of torque arrived in EPEL (from torque-2.5.7 to torque-4.2.10). The new torque version is not compatible with the standard EMI torque installation. Therefore sites are advised to not update for the time being otherwise the installation will break (some sites already reported this problem i.e. GGUS:113279) . For sites that have already upgraded, the patched 2.5.13 version of torque has been pushed to the EMI third-party repo in order to downgrade. Sites that got 4.2.x to work in the meantime can stay with it.

  • T0 and T1 services
    • FTS upgraded to v 3.2.33 at CERN and RAL
    • NDGF upgraded dCache to 2.12.8
    • CASTORALICE upgraded to 2.1.15-6 at CERN

Maarten added that Torque 2.5.13 is also in the EMI third-party repo and it has been validated by the security group. It should be indicated as baseline version.

Tier 0 News

  • The batch HTCondor pilot is open for grid submission since Monday this week, with 96 CPUs and 2 ARC CEs; Atlas and CMS are starting to use it. If other experiments are also interested, get in contact with us.
  • CERN had lower-than-usual WLCG availability figures in March for Atlas and CMS, caused by UNKNOWN test results in all CEs, most probably caused by batch overload. We had a discussion with experiment representatives and WLCG monitoring to understand the causes and find solutions. The conclusion is that the possible causes are batch overload (nothing can be done about it) or argus failures/overload. More investigation is needed on the argus one. Additionally, the mapping of the pilot role at CERN was found not optimal and will be improved; this pilot role is competing with all CMS analysis Grid jobs at CERN and is the one that most often fails because the jobs get stuck.
  • We have tentatively proposed to LHCB to decommission the LFC service at the end of June.

Maarten would like to discuss the details of how to "fix" the mapping for the 'pilot' role, as it may be less simple than expected.

In general, the issue with being unable to land SAM tests on the WN frequently enough has not yet a satisfactory solution.

Stefan announced that LHCb will stop writing to LFC from this or next week, but would like to have LFC up for a bit more time as a "safety net".

Tier 1 Feedback

Tier 2 Feedback

OS support in UMD releases

Vincenzo presented the plans in EGI for CentOS7 support. Already 13 products are ready for EPEL7, but in general CentOS7 is not a viable option for sites. The release of UMD4 (supporting EPEL7 and Ubuntu) is foreseen for September 2015 and the decommissioning of SL5 for March 2016.

It is likely that some products relevant for WLCG will not be ready for EPEL7 before 2016. Concerning worker nodes, the requirement for WLCG is to provide SL6 until the end of Run2, but as Stefan pointed out, there are already offers for resources on CentOS7 and this is an incentive for experiments to validate their software on it.

Experiments Reports

ALICE

  • normal to very high activity
    • taking advantage of opportunistic resources
  • CASTOR at CERN: file access instabilities for re-reco jobs (GGUS:113106)
    • ad-hoc cures were applied by CASTOR operations team to allow progress
    • code fix being implemented
  • intermittent proxy renewal failures for NIKHEF and SARA (GGUS:113240)

ATLAS

  • Overall running full (150k slots and more)
  • Since 2 weeks running also mc15 digi+reco -- only MCORE
    • started with 500 events per job, we noticed efficiency quite poor so doubled job length.
    • In general we are considering increasing job lengths for all the MCORE.
  • we need that all the sites are able to provide ATLAS MCORE resources. If you need help contact WLCG task force.
  • Reprocessing of 2012 muon data and 2015 cosmic data finished successfully

  • A (very bad) Rucio/FTS issue was discovered which causes missing files after Rucio resubmit jobs in FTS server takes too long to answer.
    • FTS service managers were asked to update FTS this week to mitigate problem - done, thanks (but the first announcement was done the 17th from FTS3 devs)
  • Tier-0 data and computing workflow fully commissioned.
  • Computing Run Coordinator shift starts this week
  • Tier-0/1 critical services and GGUS alarm workflow (just cross-cross-checking): has it been verified that in case of troubles all relevant people responsible for the services in the list are getting the notifications from GGUS alarms?

CMS

  • Apologies for absence(s) this week. Hyper-busy weeks with CMS meetings
    • CMS Collaboration week running now. Next week: 3-days cross-projects workshop with focus on Run2 readiness
  • CMS production activities continue
    • DIGI-only workflows observed to be very network/storage demanding.
    • Several sites reported network saturation
    • Evaluating to use selected “strong" Tier-2 sites to add computing capacity for DIGI-RECO
    • Also, cloud team continuing work on the HLT for DIGI-RECO processing
  • Xrootd fallback seems broken for CMSSW build based on ROOT6 (CMSSW_7_4_x) and first open attempted with DCAP
    • Problem identified and fixed in new releases by framework/ROOT experts
  • Plan to drop support of CRC32 checksum in CMS data transfer systems
    • Will provide only Adler32 for newly produced files
  • Had a meeting among CERN-IT, ATLAS and CMS to understand rather bad site readiness values for CERN
    • Good discussion
    • Common understanding and agreement on next steps
    • Details to be found in Maite's report

LHCb

  • Operational issues
    • SARA/NIKHEF data access problems ongoing (GGUS:113324)
      • probably related to fetch_crl
    • CASTOR/CERN SRM access problems (GGUS:113316)
    • WAN transfer issues in and out of NIPNE site (GGUS:112044)

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • gLExec in PanDA:
    • testing campaign covers all sites that currently have gLExec (61 out of 94)
    • the main issue at RAL, RALPP and TW-FTT was due to a bug in the pilot code
      • it happened to show up only at sites using the ARC CE + Condor

SHA-2

  • the old VOMS server aliases (lcg-)voms.cern.ch were removed on Tue Apr 28
  • this concludes the SHA-2 task force activities

RFC proxies

  • status of RFC proxy readiness to be followed up per experiment
    • ALICE done (being used at almost all sites where this matters)
    • CMS users are using RFC proxies since months
  • SAM-Nagios proxy renewal code fix to support RFC proxies:
    • maybe no longer needed after SAM upgrade to UMD-3
      • latest VOMS client enforces the correct proxy type automatically
  • infrastructure readiness can then be checked with the sam-preprod hosts
    • no failures expected due to proxy type

Machine/Job Features

  • NTR

Middleware Readiness WG

  • The 10th MW Readiness meeting took place yesterday May 6th Agenda.
  • After one year of full Readiness Verification activity, the WG is making a check-point of goals and priorities.
  • ATLAS and CMS were invited to review their workflow twikis for possible changes in the MW products to verify.
  • LHCb and ALICE are invited to declare if and for which products they plan to contribute to the MW Readiness WG.
  • Possible stress tests were discussed, for products already verified from within experiment workflows. The decision was to leave this to the MW Officer, the experiment involved and the site on a case by case basis, as per the original definition in the WG documentation, point 2.5..
  • EOS test for CMS at CERN just started.
  • ARGUS testbed at CERN is set-up and ready to start.
  • NDGF, PIC, CNAF and Triumf are reminded to install the pakiti client Instructions.
  • The presentation on software being developed for our activities intends to show product versions being under test and their matching to the relevant rpms and, in addition, an automated way to display Baseline versions instead of the current, manually updated, table.
  • The next vidyo MW Readiness WG meeting will take place on Wednesday June 17th at 4pm CEST

Multicore Deployment

IPv6 Validation and Deployment TF

  • LHCb: DIRAC was made IPv6-compatible back in November, but testing has started in April: a DIRAC installation on a dual stack machine is running at CERN. Successfully tested that can be contacted from IPv6 and IPv4 nodes and can run jobs submitted from LXPLUS. However, 50% of client connections fail, which was hidden by automatic retries, and it was found to be caused by a CERN python library (wrong IPV6 address returned).

Squid Monitoring and HTTP Proxy Discovery TFs

  • No news

Network and Transfer Metrics WG

  • perfSONAR status
    • Security: NDT 3.7.0.1 was released, fixing potential security issue in NDT. This shouldn't affect WLCG sites that followed our instructions, since they should have NDT/NPAD disabled. We encourage ALL sites to double check this and also to ensure they have auto-updates enabled. The latest perfSONAR Toolkit version that all sites should be running is 3.4.2-12.pSPS (Latest versions of all sub-components are Toolkit-3.4.2 (3.4.2-12.pSPS), BWCTL-1.5.4-1.el6, OWAMP-3.4-10.el6, NDT-3.7.0.1-2.el6, NPAD-1.5.6-3.el6, esmond-1.0-13.el6, Regular Testing Daemon-3.4.2-4.pSPS, iperf3-3.0.11-1.el6).
    • All meshes migrated from iperf to iperf3 and from traceroute to tracepath. This should improve our bandwidth measurements and enable MTU path discovery.
    • Very good progress in ramping up latency tests, currently with 34 sonars, we're able to consistently get results for all tested links.
  • Network performance incidents process put in place as was agreed at the last meeting (https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Performance_Incidents)
  • OSG/Datastore validation progressing well, resolved all performance issues and targeting July for production (progress already visible at http://psmad.grid.iu.edu/maddash-webui/).
  • Publishing results to message bus progressing, development has finalized for esmond2mq prototype and we plan to enter pilot phase. Initial version of the proximity service (mapping sonars to storages) in testing.
  • Last meeting held yesterday (https://indico.cern.ch/event/382623/) - focused on FTS perfromance
    • Hassen Riahi (FTS dashboard) reported on FTS performance for WLCG during the first phase of production (3 months)
    • Initial report on the FTS performance study presented by Saul Youssef (Boston University), common study for ATLAS, CMS and LHCb. Early results already provide valuable insights and also show how we could benefit from integrating FTS and perfSONAR. Agreed to follow up on a regular basis at the next meetings.
  • Next meeting 3rd of June (https://indico.cern.ch/event/382624/). Plan is to focus it on latency ramp up and proximity service.

It is again clarified the the procedure currently in place to report network issues is to write to the network and transfer metrics WG mailing list (wlcg-ops-coord-wg-metrics@cernNOSPAMPLEASE.ch). A GGUS support unit will eventually be created.

HTTP Deployment TF

Action list

  • The network incident (degradation) between Triumf and RAL reported by ATLAS will be a case to test the procedure put in place by the network metrics WG. (CLOSED)
  • CMS instructions to shifters to be changed so that tickets are not opened if just one CE is red. (ONGOING)
  • Maarten to follow-up with the experiments that the dates for removing the VOMS servers' aliases, as reported in the SHA-2 TF section above, are kept. (CLOSED)

AOB

-- MariaALANDESPRADILLO - 2015-05-06

Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r24 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback