WLCG Operations Coordination Minutes, May 7th 2015

Agenda

Attendance

  • local:
  • remote:

Operations News

  • For those who couldn't attend the WLCG Workshop in Okinawa, please note that there will be a presentation about WLCG workshop conclusions by Alessandra Forti at the GDB next week.
  • Please, remember to enter future meetings in our Indico category [WLCG Operations Coordination to avoid meeting clashes. It seems this has happened recently so we should be able to see in Indico when other meetings are scheduled and plan consequently.

Middleware News

  • Baselines:
    • UMD 3.12.0 release this week:
      • ARGUS-PAP, v. 1.6.4. fixing a blocking issue related to the latest JAVA upgrade
      • dCache server, v.2.10.24. Bug fixes . Verified at TRIUMF by MW readiness WG as well.
    • dCache 2.6.x removed from baselines as support ends in June ( 31 instances still running it, no T1s ( FNAL have 2.2. patched)).
      • Added new golden release branch with v 2.12.5, verified by NDGF
    • Just released a new version 2.10.28/ 2.12.8 of dCache fixing an important issue, some sites are planning to upgrade/have already upgrade. Would like to test it in MW readiness first ( i.e. at TRIUMF)

  • MW Issues:
    • A major upgrade of torque arrived in EPEL (from torque-2.5.7 to torque-4.2.10). The new torque version is not compatible with the standard EMI torque installation. Therefore sites are advised to not update for the time being otherwise the installation will break (some sites already reported this problem i.e. GGUS:113279) . For sites that have already upgraded, the previous version of torque has been pushed to the EMI third-party repo in order to downgrade

  • T0 and T1 services
    • FTS upgraded to v 3.2.33 at CERN and RAL
    • NDGF upgraded dCache to 2.12.8
    • CASTORALICE upgraded to 2.1.15-6 at CERN

Tier 0 News

  • The batch HTCondor pilot is open for grid submission since Monday this week, with 96 CPUs and 2 ARC CEs; Atlas and CMS are starting to use it. If other experiments are also interested, get in contact with us.
  • CERN had lower-than-usual WLCG availability figures in March for Atlas and CMS, caused by UNKNOWN test results in all CEs, most probably caused by batch overload. We had a discussion with experiment representatives and WLCG monitoring to understand the causes and find solutions. The conclusion is that the possible causes are batch overload (nothing can be done about it) or argus failures/overload. More investigation is needed on the argus one. Additionally, the mapping of the pilot role at CERN was found not optimal and will be improved; this pilot role is competing with all CMS analysis Grid jobs at CERN and is the one that most often fails because the jobs get stuck.
  • We have tentatively proposed to LHCB to decommission the LFC service at the end of June.

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • normal to very high activity
    • taking advantage of opportunistic resources
  • CASTOR at CERN: file access instabilities for re-reco jobs (GGUS:113106)
    • ad-hoc cures were applied by CASTOR operations team to allow progress
    • code fix being implemented
  • intermittent proxy renewal failures for NIKHEF and SARA (GGUS:113240)

ATLAS

  • Overall running full (150k slots and more)
  • Since 2 weeks running also mc15 digi+reco -- only MCORE
    • started with 500 events per job, we noticed efficiency quite poor so doubled job length.
    • In general we are considering increasing job lengths for all the MCORE.
  • we need that all the sites are able to provide ATLAS MCORE resources. If you need help contact WLCG task force.
  • Reprocessing of 2012 muon data and 2015 cosmic data finished successfully

  • A (very bad) Rucio/FTS issue was discovered which causes missing files after Rucio resubmit jobs in FTS server takes too long to answer.
    • FTS service managers were asked to update FTS this week to mitigate problem - done, thanks (but the first announcement was done the 17th from FTS3 devs)
  • Tier-0 data and computing workflow fully commissioned.
  • Computing Run Coordinator shift starts this week
  • Tier-0/1 critical services and GGUS alarm workflow (just cross-cross-checking): has it been verified that in case of troubles all relevant people responsible for the services in the list are getting the notifications from GGUS alarms?

CMS

  • Apologies for absence(s) this week. Hyper-busy weeks with CMS meetings
    • CMS Collaboration week running now. Next week: 3-days cross-projects workshop with focus on Run2 readiness
  • CMS production activities continue
    • DIGI-only workflows observed to be very network/storage demanding.
    • Several sites reported network saturation
    • Evaluating to use selected “strong" Tier-2 sites to add computing capacity for DIGI-RECO
    • Also, cloud team continuing work on the HLT for DIGI-RECO processing
  • Xrootd fallback seems broken for CMSSW build based on ROOT6 (CMSSW_7_4_x) and first open attempted with DCAP
    • Problem identified and fixed in new releases by framework/ROOT experts
  • Plan to drop support of CRC32 checksum in CMS data transfer systems
    • Will provide only Adler32 for newly produced files
  • Had a meeting among CERN-IT, ATLAS and CMS to understand rather bad site readiness values for CERN
    • Good discussion
    • Common understanding and agreement on next steps
    • Details to be found in Maite's report

LHCb

  • Operational issues
    • SARA/NIKHEF data access problems ongoing (GGUS:113324)
      • probably related to fetch_crl
    • CASTOR/CERN SRM access problems (GGUS:113316)
    • WAN transfer issues in and out of NIPNE site (GGUS:112044)

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • gLExec in PanDA:
    • testing campaign covers all sites that currently have gLExec (61 out of 94)
    • the main issue at RAL, RALPP and TW-FTT was due to a bug in the pilot code
      • it happened to show up only at sites using the ARC CE + Condor

SHA-2

  • the old VOMS server aliases (lcg-)voms.cern.ch were removed on Tue Apr 28
  • this concludes the SHA-2 task force activities

RFC proxies

  • status of RFC proxy readiness to be followed up per experiment
    • ALICE done (being used at almost all sites where this matters)
    • CMS users are using RFC proxies since months
  • SAM-Nagios proxy renewal code fix to support RFC proxies:
    • maybe no longer needed after SAM upgrade to UMD-3
      • latest VOMS client enforces the correct proxy type automatically
  • infrastructure readiness can then be checked with the sam-preprod hosts
    • no failures expected due to proxy type

Machine/Job Features

  • NTR

Middleware Readiness WG

Multicore Deployment

IPv6 Validation and Deployment TF

  • LHCb: DIRAC was made IPv6-compatible back in November, but testing has started in April: a DIRAC installation on a dual stack machine is running at CERN. Successfully tested that can be contacted from IPv6 and IPv4 nodes and can run jobs submitted from LXPLUS. However, 50% of client connections fail, which was hidden by automatic retries, and it was found to be caused by a CERN python library (wrong IPV6 address returned).

Squid Monitoring and HTTP Proxy Discovery TFs

  • No news

Network and Transfer Metrics WG

  • perfSONAR status
    • Security: NDT 3.7.0.1 was released, fixing potential security issue in NDT. This shouldn't affect WLCG sites that followed our instructions, since they should have NDT/NPAD disabled. We encourage ALL sites to double check this and also to ensure they have auto-updates enabled. The latest perfSONAR Toolkit version that all sites should be running is 3.4.2-12.pSPS (Latest versions of all sub-components are Toolkit-3.4.2 (3.4.2-12.pSPS), BWCTL-1.5.4-1.el6, OWAMP-3.4-10.el6, NDT-3.7.0.1-2.el6, NPAD-1.5.6-3.el6, esmond-1.0-13.el6, Regular Testing Daemon-3.4.2-4.pSPS, iperf3-3.0.11-1.el6).
    • All meshes migrated from iperf to iperf3 and from traceroute to tracepath. This should improve our bandwidth measurements and enable MTU path discovery.
    • Very good progress in ramping up latency tests, currently with 34 sonars, we're able to consistently get results for all tested links.
  • Network performance incidents process put in place as was agreed at the last meeting (https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Performance_Incidents)
  • OSG/Datastore validation progressing well, resolved all performance issues and targeting July for production (progress already visible at http://psmad.grid.iu.edu/maddash-webui/).
  • Publishing results to message bus progressing, development has finalized for esmond2mq prototype and we plan to enter pilot phase. Initial version of the proximity service (mapping sonars to storages) in testing.
  • Last meeting held yesterday (https://indico.cern.ch/event/382623/) - focused on FTS perfromance
    • Hassen Riahi (FTS dashboard) reported on FTS performance for WLCG during the first phase of production (3 months)
    • Initial report on the FTS performance study presented by Saul Youssef (Boston University), common study for ATLAS, CMS and LHCb. Early results already provide valuable insights and also show how we could benefit from integrating FTS and perfSONAR. Agreed to follow up on a regular basis at the next meetings.
  • Next meeting 3rd of June (https://indico.cern.ch/event/382624/). Plan is to focus it on latency ramp up and proximity service.

HTTP Deployment TF

Action list

  • The network incident (degradation) between Triumf and RAL reported by ATLAS will be a case to test the procedure put in place by the network metrics WG.
  • CMS instructions to shifters to be changed so that tickets are not opened if just one CE is red.
  • Maarten to follow-up with the experiments that the dates for removing the VOMS servers' aliases, as reported in the SHA-2 TF section above, are kept.

AOB

-- MariaALANDESPRADILLO - 2015-05-06

Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r21 - 2015-05-07 - AndreaManzi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback