LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes160602 (revision 18)~~EditAttachPDF~~

WLCG Operations Coordination Minutes, June 2nd 2016

Highlights

Agenda

https://indico.cern.ch/event/514079/

Attendance

local:
remote:
apologies: Alessandra Doria

Operations News

As the last Ops Coord meeting took place on April 28th, here are some of May's events:

The decision to create a WLCG Accounting Task Force was approved at the May 24th MB.
PIC published on May 18th the SIR on the T10KD problems they were experiencing since last December.
Web info around certificates was updated on http://cern.ch/lcg to reflect changes around the OSG PKI.
GGUS Release on June 1st. Release notes available from https://ggus.eu/?mode=release_notes&2016-06-01

New WLCG Accounting TF

Middleware News

Useful Links:

Baselines/News:
- SLC6.8/CentOS 6.8 is out: (standard SL will soon follow)
  - contains a fix for an openldap crash affecting TopBDII and ARC-CE resource BDII. Already tested at CERN and working fine. The production BDII at CERN will soon be updated.
  - edg-mkgridmap 4.0.2 is not working anymore with this update. A new version (4.0.3) already available for CentOS7 fixes the issue and it’s backward compatible. It has been pushed to the WLCG repo for SL6 and it’s already available in EPEL testing
- UMD 4 is out initially with UMD 3 packages for SL6 plus first CentOS7/Ubuntu packages available. A new updates (4.1.0) has been released already it contains a new version of Storm ( 1.11.11) which fixes a vulnerability on the WebDAV interface (http://repository.egi.eu/2016/05/27/storm-1-11-11/). To become WLCG baseline.
- New version of Xrootd Atlas N2N plugin available on WLCG repo. It contains performance improvements related to the latest version of Xrootd.
- A first testing version of the UI for CentOS 7 has been published to the CVMFS grid.cern.ch (/cvmfs/grid.cern.ch/centos7-ui-test). The process to release it in UMD has started.
Issues:
- Found a problem affecting FTS Soap interface (only used by CMS Phedex) where the Optimizer is not enabled. Problem introduced since v 3.3.0. The fix released in FTS 3.4.5 has been already deployed at CERN and the others will update starting from next week. (There is a workaround to manually enable the optimizer already applied at RAL)
- glite-ce-job-submit to CREAM-CEs with dual-stack fails from CERN (GGUS:120586). Problem affecting both LHCb and ATLAS.
- Several vulnerabilities have been discovered during the last months (not public yet). Sites have been contacted by EGI SVG broadcasts or individually.

T0 and T1 services
- BNL
  - moved to Xrootd 4.3.0
- CERN
  - Xrootd on Castor nodes upgraded to 4.3.0
  - FTS upgraded to v 3.4.5
  - TopBDII upgrade to SLC6.8 planned
  - possible Castor & EOS for ATLAS upgrade next week during the technical stop
- CNAF
  - StoRM upgrade to 1.11.11
- IN2P3
  - dCache upgrade to 2.13.26 (on pools)
  - global dCache upgrade to 2.13.30 for June downtime
  - xrootd Alice T2 (disk only) upgrade to 4.2.3-3 for June downtime
- JINR
  - dCache upgrade to 2.10.59; postgresql update to 9.4.8
- NDGF
  - dCache upgraded to 2.16 pre release
- TRIUMF
  - Plan to upgrade to dCache 2.13.30 on June 8th

Tier 0 News

Overloads and saturations of the Meyrin-Wigner link have been investigated; some occurences have been traced back go an experiment accessing several petabytes of very old data recorded in Meyrin before the Wigner capacity became available. These data may need to be rebalanced. The problem will be alleviated when the third 100 Gbps link will become available; progress has been made on identifying the local loop in Budapest.

Poor performance of CMS transfers to Wigner have been traced to the behaviour of CERN's switches; flow control will be added to the switches on 06-Jun-2016.

A new configuration of the ALICE->CASTOR system was validated demonstrating stable running conditions. Work is going on to further increase the ability to empty the ALICE DAQ disks to catch up after potential transfer interruptions.

A new version of ARGUS, supposedly addressing the long-standing stability issues, was found to have introduced two new bugs, and was not deployed. We are awaiting further fixes by the developer.

The CREAM CEs have shown instabilities that we are trying to address temporarily by using larger hosts. At the same time, investigations are going on on a number of potential options to resolve the issue.

A scale test with CMS for writing into EOS is being prepared in order to validate configuration changes intended to increase the transfer rates as needed by the activities of the heavy-ion community.

The modified planning for the LHC technical stop in June has had an impact on the Tier-0 activity planning. It is now foreseen to deploy a number of security, functional and operating system changes in the DB services pending since the previous technical stop; subject to approval by ATLAS, a minor EOS upgrade (to 0.3.171) may be applied as well.

The strategy for CPU resources for CMS (both Tier-0 and Tier-2 capacity at CERN) will be discussed further with the CMS computing management.

DB News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

High activity on average
SAM: the critical profile for A/R computations now includes selected MonALISA metrics
- see http://wlcg-sam-alice.cern.ch
- in particular the SE services are now taken into account
- the profile will be used as of May 1st (sic)
  - but this first month we are flexible with corrections
CERN: CREAM job submission incident on Sunday May 22 (GGUS:121699)
- None of the CEs worked
- Most had their sandbox partitions full
- The main usage was due to ALICE jobs, apologies!
- An ad-hoc cleanup was done
- New jobs will by default not make use of output sandboxes
  - they are only needed for debugging

ATLAS

Large scale MC15c digi+reco (MCORE) production campaign is done, ~5B events produced in ~3 months. MC16 simulation is being tested in the system now.
Derivation production is also done on the reprocessed data and MC15c samples. Total volume of derivations exceeded the budget by ~50%, being discussed. New round of derivation production for data+MC15c will start soon with updated software.
CERN resources not fully used due to ATLAS Pilot Factory/Cream problem (Condor's CREAM_DELEGATION error seen on some CREAM CEs), followed up at https://its.cern.ch/jira/browse/ASPDA-196, Condor team provided a patched binary for condor 8.4, will be deployed.
Direct I/O instead of copy-to-scratch mode is being tested with HammerCloud and derivations jobs, the latter being the main customer.

CMS

data taking resumed Thursday 26th afternoon
CMS StorageManager team discussing with CERN-IT to optimize transfers from P5 to CERN EOS in order to cope with the rate needed for HI run at the end of the year

MC DigiReco production for ICHEP continuing
T1/T2 quotas increased to match 2016 disk pledges

Enabling multi-core pilots: 36800 T1, 81000 T2 slots enabled
Will start to check IPV6 dual IP stack with several services
- Myproxy, VOMS, CVMFS, FTS, PhEDEx, Frontier, Xrootd, cmsweb

We look forward for a first CRIC trial setup/prototype for CMS

Saturday 30th, TWiki including shifter instructions were not reachable
- GGUS:121817, fixed quickly by CERN-IT

LHCb

Still validating workflows for 2016 data processing
- After validation is finished do not expect strain on sites, as files are staged.
- We are however migrating RAW files to tape (i.e. quite some load on Castor and low pressure at Tier1s, typically 150 MB/s) and we shall migrate RDST files when produced.
Generally low activity levels on the grid

Infrastructure :
- Commissioning more dual-stack VO-boxes and moving LHCbDirac services to them

T1 sites:
- RAL : Two diskservers went down on Tuesday. Back on Wednesday afternoon

Ongoing Task Forces and Working Groups

HTTP Deployment TF

The last meeting on 23rd Mar (https://indico.cern.ch/event/509854) decided to close the Task Force.
The TF monitoring has passed into experiment responsibility with the move to production of ETF.
- Around 80 GGUS tickets were issued and followed up with sites and configuration advice was documented.
- Atlas status
- LHCb status
A summary of the TF's activities has been added to the twiki - https://twiki.cern.ch/twiki/bin/view/LCG/HTTPDeployment
The documentation created by the TF, in particular relating to HTTP support by storage systems and access monitoring, is also available on the twiki.

Information System Evolution

Evaluation with REBUS developers and AGIS developers of the capacity view to consider whether it can be dropped from REBUS
Ongoing discussions with EGI and relevant experts (MW Officer, Security) on evaluating the impact of stopping BDII
Testing deployment of CRIC prototype as well as playing with first CMS data on it. More details in CRIC Evaluation
WLCG scope tags in GOCDB in the process to be validated (33 out of 123 tickets still open). Thanks to the sites for their effort. More details in VO Tags validation twiki. There are ongoing discussions to decide whether it makes sense to repeat this validation on a regular basis and whether it makes sense to compare with VOfeeds and maintain a 1 to 1 relationship between tags and resources in VOfeed.
Next TF meeting scheduled on 16.06.2016

IPv6 Validation and Deployment TF

Next week's pre-GDB is devoted to IPv6, as well as two-hours slot in the GDB. The main topics to be discussed are:

Experiment requirements
Status of support to IPv6-only CPUs
Experience on dual-stack services
Monitoring and IPv6
Security and IPv6
Status of WLCG tiers and LHCOPN/LHCONE

Machine/Job Features TF

Working through mandate's steps:

Check the completeness of the proposal for machine/job features (The specification was published as HSF-TN-2016-02)
Coordinate implementations used in WLCG and an interface for its usage to the VOs (MJF for PBS/Torque, HTCondor, and Grid Engine in production)
Provide means to monitor the correctness of the provided information (Probe for ETF/SAM now on etf-lhcb-preprod)
Plan and execute the deployment of those implementations at all WLCG resources

MJF deployments on batch passing tests at Cambridge (Torque), GridKa (Grid Engine), Liverpool(HTCondor), Manchester (Torque), PIC (Torque), RAL PPD (HTCondor). Aware of other sites currently working on deployment (e.g. RAL Tier-1 and CERN.) UK intends to deploy at all sites.

All Vac/Vcycle VM-based sites have MJF available as part of the VM management system.

Status talk given at GDB in May, which has more about the specification and implementations.

Middleware Readiness WG

Network and Transfer Metrics WG

WLCG Network Throughput SU:
- ASGC connectivity - After numerous tests performed in collaboration with ASGC and ESNet (http://etf.cern.ch/perfsonar_asgc.txt) the root cause has been confirmed to be the local N7K router at ASGC. Once the perfSONARs were moved directly to the central router the measured network performance has improved by factor 10. Our recommendation is to re-wire all the existing data transfer nodes to bypass the local router as well as to tune the central router and data transfer nodes to improve their performance for long path transfers (200ms+).
- Two new tickets received related to packet loss observed at RAL and SARA
Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
North American Throughput meeting held on 1st of June:
- Shawn presented OSG network area roadmap, main focus will be on developing notification/alerting, support for higher-level services (analytics) and prepare for SDN
- Next meeting is on 22nd June - main topic will be perfSONAR 4.0
WLCG Throughput meeting will be held on 16th of June - main topic is re-organization of the meshes
perfSONAR 4.0 (formerly 3.6) RC expected end of June

RFC proxies

EGI and OSG have been asked if RFC can be made the default proxy type
- new version of VOMS clients
- new version of YAIM core for myproxy-init
- both changes are trivial
- EGI will get back soon, OSG did not reply yet
any remaining compatibility concerns for central services?
pilot factories?
- the main concern for them was the use of old CREAM code in HTCondor
  - that also happens to use MD5 in proxy delegations
- all factories should be OK now?

Squid Monitoring and HTTP Proxy Discovery TFs

IT WLCG Monitoring Consolidation

Action list

Creation date	Description	Responsible	Status	Comments
17.03.2016	ALICE, ATLAS and LHCb can stop gLExec tests at their convenience. Recipes have been provided.	Maarten	DONE

Specific actions for experiments

Creation date	Description	Affected VO	Affected TF	Comments	Deadline	Completion
29.04.2016	Unify HTCondor CE type name in experiments VOfeeds	all	-	Proposal to use HTCONDOR-CE. Waiting for VOfeed responsibles to confirm this is updated		Ongoing

Specific actions for sites

Creation date	Description	Affected VO	Affected TF	Comments	Deadline	Completion

AOB

-- MariaALANDESPRADILLO - 2016-06-01

Topic revision: r18 - 2016-06-02 - AndreaManzi

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback