For those who couldn't attend the WLCG Workshop in Okinawa, please note that there will be a presentation about WLCG workshop conclusions by Alessandra Forti at the GDB next week.
Please, remember to enter future meetings in our Indico category [WLCG Operations Coordination to avoid meeting clashes. It seems this has happened recently so we should be able to see in Indico when other meetings are scheduled and plan consequently.
ARGUS-PAP, v. 1.6.4. fixing a blocking issue related to the latest JAVA upgrade
dCache server, v.2.10.24. Bug fixes . Verified at TRIUMF by MW readiness WG as well.
dCache 2.6.x removed from baselines as support ends in June ( 31 instances still running it, no T1s ( FNAL have 2.2. patched)).
Added new golden release branch with v 2.12.5, verified by NDGF
Just released a new version 2.10.28/ 2.12.8 of dCache fixing an important issue, some sites are planning to upgrade/have already upgrade. Would like to test it in MW readiness first ( i.e. at TRIUMF)
MW Issues:
A major upgrade of torque arrived in EPEL (from torque-2.5.7 to torque-4.2.10). The new torque version is not compatible with the standard EMI torque installation. Therefore sites are advised to not update for the time being otherwise the installation will break (some sites already reported this problem i.e. GGUS:113279) . For sites that have already upgraded, the previous version of torque has been pushed to the EMI third-party repo in order to downgrade
The batch HTCondor pilot is open for grid submission since Monday this week, with 96 CPUs and 2 ARC CEs; Atlas and CMS are starting to use it. If other experiments are also interested, get in contact with us.
CERN had lower-than-usual WLCG availability figures in March for Atlas and CMS, caused by UNKNOWN test results in all CEs, most probably caused by batch overload. We had a discussion with experiment representatives and WLCG monitoring to understand the causes and find solutions. The conclusion is that the possible causes are batch overload (nothing can be done about it) or argus failures/overload. More investigation is needed on the argus one. Additionally, the mapping of the pilot role at CERN was found not optimal and will be improved; this pilot role is competing with all CMS analysis Grid jobs at CERN and is the one that most often fails because the jobs get stuck.
We have tentatively proposed to LHCB to decommission the LFC service at the end of June.
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
normal to very high activity
taking advantage of opportunistic resources
CASTOR at CERN: file access instabilities for re-reco jobs (GGUS:113106)
ad-hoc cures were applied by CASTOR operations team to allow progress
code fix being implemented
intermittent proxy renewal failures for NIKHEF and SARA (GGUS:113240)
Since 2 weeks running also mc15 digi+reco -- only MCORE
started with 500 events per job, we noticed efficiency quite poor so doubled job length.
In general we are considering increasing job lengths for all the MCORE.
we need that all the sites are able to provide ATLAS MCORE resources. If you need help contact WLCG task force.
Reprocessing of 2012 muon data and 2015 cosmic data finished successfully
A (very bad) Rucio/FTS issue was discovered which causes missing files after Rucio resubmit jobs in FTS server takes too long to answer.
FTS service managers were asked to update FTS this week to mitigate problem - done, thanks (but the first announcement was done the 17th from FTS3 devs)
Tier-0 data and computing workflow fully commissioned.
Computing Run Coordinator shift starts this week
Tier-0/1 critical services and GGUS alarm workflow (just cross-cross-checking): has it been verified that in case of troubles all relevant people responsible for the services in the list are getting the notifications from GGUS alarms?
CMS
Apologies for absence(s) this week. Hyper-busy weeks with CMS meetings
CMS Collaboration week running now. Next week: 3-days cross-projects workshop with focus on Run2 readiness
CMS production activities continue
DIGI-only workflows observed to be very network/storage demanding.
Several sites reported network saturation
Evaluating to use selected “strong" Tier-2 sites to add computing capacity for DIGI-RECO
Also, cloud team continuing work on the HLT for DIGI-RECO processing
Xrootd fallback seems broken for CMSSW build based on ROOT6 (CMSSW_7_4_x) and first open attempted with DCAP
Problem identified and fixed in new releases by framework/ROOT experts
Plan to drop support of CRC32 checksum in CMS data transfer systems
Will provide only Adler32 for newly produced files
Had a meeting among CERN-IT, ATLAS and CMS to understand rather bad site readiness values for CERN
Good discussion
Common understanding and agreement on next steps
Details to be found in Maite's report
LHCb
Operational issues
SARA/NIKHEF data access problems ongoing (GGUS:113324)
status of RFC proxy readiness to be followed up per experiment
ALICE done (being used at almost all sites where this matters)
CMS users are using RFC proxies since months
SAM-Nagios proxy renewal code fix to support RFC proxies:
maybe no longer needed after SAM upgrade to UMD-3
latest VOMS client enforces the correct proxy type automatically
infrastructure readiness can then be checked with the sam-preprod hosts
no failures expected due to proxy type
Machine/Job Features
NTR
Middleware Readiness WG
Multicore Deployment
IPv6 Validation and Deployment TF
LHCb: DIRAC was made IPv6-compatible back in November, but testing has started in April: a DIRAC installation on a dual stack machine is running at CERN. Successfully tested that can be contacted from IPv6 and IPv4 nodes and can run jobs submitted from LXPLUS. However, 50% of client connections fail, which was hidden by automatic retries, and it was found to be caused by a CERN python library (wrong IPV6 address returned).
Squid Monitoring and HTTP Proxy Discovery TFs
No news
Network and Transfer Metrics WG
perfSONAR status
Security: NDT 3.7.0.1 was released, fixing potential security issue in NDT. This shouldn't affect WLCG sites that followed our instructions, since they should have NDT/NPAD disabled. We encourage ALL sites to double check this and also to ensure they have auto-updates enabled. The latest perfSONAR Toolkit version that all sites should be running is 3.4.2-12.pSPS (Latest versions of all sub-components are Toolkit-3.4.2 (3.4.2-12.pSPS), BWCTL-1.5.4-1.el6, OWAMP-3.4-10.el6, NDT-3.7.0.1-2.el6, NPAD-1.5.6-3.el6, esmond-1.0-13.el6, Regular Testing Daemon-3.4.2-4.pSPS, iperf3-3.0.11-1.el6).
All meshes migrated from iperf to iperf3 and from traceroute to tracepath. This should improve our bandwidth measurements and enable MTU path discovery.
Very good progress in ramping up latency tests, currently with 34 sonars, we're able to consistently get results for all tested links.
OSG/Datastore validation progressing well, resolved all performance issues and targeting July for production (progress already visible at http://psmad.grid.iu.edu/maddash-webui/).
Publishing results to message bus progressing, development has finalized for esmond2mq prototype and we plan to enter pilot phase. Initial version of the proximity service (mapping sonars to storages) in testing.
Hassen Riahi (FTS dashboard) reported on FTS performance for WLCG during the first phase of production (3 months)
Initial report on the FTS performance study presented by Saul Youssef (Boston University), common study for ATLAS, CMS and LHCb. Early results already provide valuable insights and also show how we could benefit from integrating FTS and perfSONAR. Agreed to follow up on a regular basis at the next meetings.
The network incident (degradation) between Triumf and RAL reported by ATLAS will be a case to test the procedure put in place by the network metrics WG.
CMS instructions to shifters to be changed so that tickets are not opened if just one CE is red.
Maarten to follow-up with the experiments that the dates for removing the VOMS servers' aliases, as reported in the SHA-2 TF section above, are kept.