WLCG Operations Coordination Minutes, June 2nd 2016
Highlights
Agenda
Attendance
- local:
- remote:
- apologies: Alessandra Doria
Operations News
As the last Ops Coord meeting took place on April 28th, here are some of May's events:
New WLCG Accounting TF
Middleware News
- Baselines/News:
- SLC6.8/CentOS 6.8 is out: (standard SL will soon follow)
- contains a fix for an openldap crash affecting TopBDII and ARC-CE resource BDII. Already tested at CERN and working fine. The production BDII at CERN will soon be updated.
- edg-mkgridmap 4.0.2 is not working anymore with this update. A new version (4.0.3) already available for CentOS7 fixes the issue and it’s backward compatible. It has been pushed to the WLCG repo for SL6 and it’s already available in EPEL testing
- UMD 4 is out initially with UMD 3 packages for SL6 plus first CentOS7/Ubuntu packages available. A new updates (4.1.0) has been released already it contains a new version of Storm ( 1.11.11) which fixes a vulnerability on the WebDAV interface (http://repository.egi.eu/2016/05/27/storm-1-11-11/). To become WLCG baseline.
- New version of Xrootd Atlas N2N plugin available on WLCG repo. It contains performance improvements related to the latest version of Xrootd.
- A first testing version of the UI for CentOS 7 has been published to the CVMFS grid.cern.ch (/cvmfs/grid.cern.ch/centos7-ui-test). The process to release it in UMD has started.
- Issues:
- Found a problem affecting FTS Soap interface (only used by CMS Phedex) where the Optimizer is not enabled. Problem introduced since v 3.3.0. The fix released in FTS 3.4.5 has been already deployed at CERN and the others will update starting from next week. (There is a workaround to manually enable the optimizer already applied at RAL)
- glite-ce-job-submit to CREAM-CEs with dual-stack fails from CERN (GGUS:120586). Problem affecting both LHCb and ATLAS.
- Several vulnerabilities have been discovered during the last months (not public yet). Sites have been contacted by EGI SVG broadcasts or individually.
- T0 and T1 services
- BNL
- CERN
- Xrootd on Castor nodes upgraded to 4.3.0
- FTS upgraded to v 3.4.5
- TopBDII upgrade to SLC6.8 planned
- possible Castor & EOS for ATLAS upgrade next week during the technical stop
- CNAF
- IN2P3
- dCache upgrade to 2.13.26 (on pools)
- global dCache upgrade to 2.13.30 for June downtime
- xrootd Alice T2 (disk only) upgrade to 4.2.3-3 for June downtime
- JINR
- dCache upgrade to 2.10.59; postgresql update to 9.4.8
- NDGF
- dCache upgraded to 2.16 pre release
- TRIUMF
- Plan to upgrade to dCache 2.13.30 on June 8th
Tier 0 News
- Overloads and saturations of the Meyrin-Wigner link have been investigated; some occurences have been traced back go an experiment accessing several petabytes of very old data recorded in Meyrin before the Wigner capacity became available. These data may need to be rebalanced. The problem will be alleviated when the third 100 Gbps link will become available; progress has been made on identifying the local loop in Budapest.
- Poor performance of CMS transfers to Wigner have been traced to the behaviour of CERN's switches; flow control will be added to the switches on 06-Jun-2016.
- A new configuration of the ALICE->CASTOR system was validated demonstrating stable running conditions. Work is going on to further increase the ability to empty the ALICE DAQ disks to catch up after potential transfer interruptions.
- A new version of ARGUS, supposedly addressing the long-standing stability issues, was found to have introduced two new bugs, and was not deployed. We are awaiting further fixes by the developer.
- The CREAM CEs have shown instabilities that we are trying to address temporarily by using larger hosts. At the same time, investigations are going on on a number of potential options to resolve the issue.
- A scale test with CMS for writing into EOS is being prepared in order to validate configuration changes intended to increase the transfer rates as needed by the activities of the heavy-ion community.
- The modified planning for the LHC technical stop in June has had an impact on the Tier-0 activity planning. It is now foreseen to deploy a number of security, functional and operating system changes in the DB services pending since the previous technical stop; subject to approval by ATLAS, a minor EOS upgrade (to 0.3.171) may be applied as well.
- The strategy for CPU resources for CMS (both Tier-0 and Tier-2 capacity at CERN) will be discussed further with the CMS computing management.
DB News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- High activity on average
- SAM: the critical profile for A/R computations now includes selected MonALISA metrics
- see http://wlcg-sam-alice.cern.ch
- in particular the SE services are now taken into account
- the profile will be used as of May 1st (sic)
- but this first month we are flexible with corrections
- CERN: CREAM job submission incident on Sunday May 22 (GGUS:121699)
- None of the CEs worked
- Most had their sandbox partitions full
- The main usage was due to ALICE jobs, apologies!
- An ad-hoc cleanup was done
- New jobs will by default not make use of output sandboxes
- they are only needed for debugging
ATLAS
- Large scale MC15c digi+reco (MCORE) production campaign is done, ~5B events produced in ~3 months. MC16 simulation is being tested in the system now.
- Derivation production is also done on the reprocessed data and MC15c samples. Total volume of derivations exceeded the budget by ~50%, being discussed. New round of derivation production for data+MC15c will start soon with updated software.
- CERN resources not fully used due to ATLAS Pilot Factory/Cream problem (Condor's CREAM_DELEGATION error seen on some CREAM CEs), followed up at https://its.cern.ch/jira/browse/ASPDA-196, Condor team provided a patched binary for condor 8.4, will be deployed.
- Direct I/O instead of copy-to-scratch mode is being tested with HammerCloud and derivations jobs, the latter being the main customer.
CMS
- data taking resumed Thursday 26th afternoon
- CMS StorageManager team discussing with CERN-IT to optimize transfers from P5 to CERN EOS in order to cope with the rate needed for HI run at the end of the year
- MC DigiReco production for ICHEP continuing
- T1/T2 quotas increased to match 2016 disk pledges
- Enabling multi-core pilots: 36800 T1, 81000 T2 slots enabled
- Will start to check IPV6 dual IP stack with several services
- Myproxy, VOMS, CVMFS, FTS, PhEDEx, Frontier, Xrootd, cmsweb
- We look forward for a first CRIC trial setup/prototype for CMS
- Saturday 30th, TWiki including shifter instructions were not reachable
LHCb
- Still validating workflows for 2016 data processing
- After validation is finished do not expect strain on sites, as files are staged.
- We are however migrating RAW files to tape (i.e. quite some load on Castor and low pressure at Tier1s, typically 150 MB/s) and we shall migrate RDST files when produced.
- Generally low activity levels on the grid
- Infrastructure :
- Commissioning more dual-stack VO-boxes and moving LHCbDirac services to them
- T1 sites:
- RAL : Two diskservers went down on Tuesday. Back on Wednesday afternoon
Ongoing Task Forces and Working Groups
HTTP Deployment TF
- The last meeting on 23rd Mar (https://indico.cern.ch/event/509854) decided to close the Task Force.
- The TF monitoring has passed into experiment responsibility with the move to production of ETF.
- A summary of the TF's activities has been added to the twiki - https://twiki.cern.ch/twiki/bin/view/LCG/HTTPDeployment
- The documentation created by the TF, in particular relating to HTTP support by storage systems and access monitoring, is also available on the twiki.
Information System Evolution
- Evaluation with REBUS developers and AGIS developers of the capacity view to consider whether it can be dropped from REBUS
- Ongoing discussions with EGI and relevant experts (MW Officer, Security) on evaluating the impact of stopping BDII
- Testing deployment of CRIC prototype as well as playing with first CMS data on it. More details in CRIC Evaluation
- WLCG scope tags in GOCDB in the process to be validated (33 out of 123 tickets still open). Thanks to the sites for their effort. More details in VO Tags validation twiki. There are ongoing discussions to decide whether it makes sense to repeat this validation on a regular basis and whether it makes sense to compare with VOfeeds and maintain a 1 to 1 relationship between tags and resources in VOfeed.
- Next TF meeting scheduled on 16.06.2016
IPv6 Validation and Deployment TF
Next week's pre-GDB is devoted to IPv6, as well as two-hours slot in the GDB. The main topics to be discussed are:
- Experiment requirements
- Status of support to IPv6-only CPUs
- Experience on dual-stack services
- Monitoring and IPv6
- Security and IPv6
- Status of WLCG tiers and LHCOPN/LHCONE
Machine/Job Features TF
Working through
mandate's steps:
- Check the completeness of the proposal for machine/job features (The specification was published as HSF-TN-2016-02)
- Coordinate implementations used in WLCG and an interface for its usage to the VOs (MJF for PBS/Torque, HTCondor, and Grid Engine in production)
- Provide means to monitor the correctness of the provided information (Probe for ETF/SAM now on etf-lhcb-preprod)
- Plan and execute the deployment of those implementations at all WLCG resources
MJF deployments on batch passing tests at Cambridge (Torque),
GridKa (Grid Engine), Liverpool(HTCondor), Manchester (Torque), PIC (Torque),
RAL PPD (HTCondor). Aware of other sites currently working on deployment (e.g.
RAL Tier-1 and CERN.) UK intends to deploy at all sites.
All Vac/Vcycle VM-based sites have MJF available as part of the VM management system.
Status talk given at
GDB in May, which has more about the specification and implementations.
Middleware Readiness WG
Network and Transfer Metrics WG
- WLCG Network Throughput SU:
- ASGC connectivity - After numerous tests performed in collaboration with ASGC and ESNet (http://etf.cern.ch/perfsonar_asgc.txt) the root cause has been confirmed to be the local N7K router at ASGC. Once the perfSONARs were moved directly to the central router the measured network performance has improved by factor 10. Our recommendation is to re-wire all the existing data transfer nodes to bypass the local router as well as to tune the central router and data transfer nodes to improve their performance for long path transfers (200ms+).
- Two new tickets received related to packet loss observed at RAL and SARA
- Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
- North American Throughput meeting held on 1st of June:
- Shawn presented OSG network area roadmap, main focus will be on developing notification/alerting, support for higher-level services (analytics) and prepare for SDN
- Next meeting is on 22nd June - main topic will be perfSONAR 4.0
- WLCG Throughput meeting will be held on 16th of June - main topic is re-organization of the meshes
- perfSONAR 4.0 (formerly 3.6) RC expected end of June
RFC proxies
- EGI and OSG have been asked if RFC can be made the default proxy type
- new version of VOMS clients
- new version of YAIM core for myproxy-init
- both changes are trivial
- EGI will get back soon, OSG did not reply yet
- any remaining compatibility concerns for central services?
- pilot factories?
- the main concern for them was the use of old CREAM code in HTCondor
- that also happens to use MD5 in proxy delegations
- all factories should be OK now?
Squid Monitoring and HTTP Proxy Discovery TFs
IT WLCG Monitoring Consolidation
Action list
Specific actions for experiments
Specific actions for sites
AOB
--
MariaALANDESPRADILLO - 2016-06-01