LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes170518 (2018-02-28, MaartenLitmaath)

EditAttachPDF

WLCG Operations Coordination Minutes, May 18th 2017

Highlights

If you plan to attend the
WLCG workshop
, please register ASAP
- See the Operations News for details
Issues with ARC CEs patching - presentation by KIT
Theme: Providing reliable storage - presentation by TRIUMF

Agenda

https://indico.cern.ch/event/639351/

Attendance

local: Andrea M (MW Officer + data mgmt), Edoardo (networks), Gavin (T0), Giuseppe (CMS), Julia (WLCG), Maarten (WLCG + ALICE), Marcelo (LHCb), Marian (networks)
remote: Alessandra D (Napoli), Alessandra F (WLCG + Manchester + ATLAS), Andrea S (WLCG), Christoph (CMS), Di (TRIUMF), Felix (ASGC), Jeremy (GridPP), Max (KIT), Nurcan (ATLAS), Renaud (IN2P3), Simon (TRIUMF), Thomas (DESY), Victor (JINR), Vladimir (CNAF)

apologies:

Operations News

WLCG workshop

registration deadline has been extended to the 31st of May.
Participants in the workshop should however register ASAP

Main workshop: Mon early afternoon - Wednesday
IPv6 hands on session on Thursday morning
Optional visit to Jodrell Bank Observatory Thursday afternoon (max 30 people, first-in basis)

the next meeting will be held Thu July 6

Report from the network group regarding MTU negotiation problem between CERN routable IPs and T1

The Path MTU discovery (PMTUD) protocol is not working with remote locations, because of a) strict filtering of ICMP packets on the CERN firewall; b) use of private addresses on the internal links c) use of Jumbo frames everywhere in the CERN interconnecting links, except on user services.

The problem with CNAF arose because CNAF has Jumbo enabled servers. c) caused the jumbo packets from CNAF to reach the datacentre router facing RAC52, which discarded it because too big to be delivered to RAC52; at the same time the router sent back an "ICMP fragmentation needed" packet to CNAF; a) and b) dropped the ICMP packet to CNAF and made PMTUD fail.

As a temporary workaround the link facing the CERN firewall has been changed to normal MTU; in this way an external router is now sending back "ICMP fragmentation needed" packets and makes PMTUD works.

As a long term solution we need to: -1- configure public addresses on the links interconnecting datacentre routers; this task is on-going -2- allow "ICMP fragmentation needed" packets through the CERN firewall; this is done -3- once 1- and 2- are completed, change back the external links' MTU to 9000

Julia: timeline for jumbo frames to be allowed externally?
Edoardo: it will take 1 or 2 years for old routers to be replaced

Middleware News

Useful Links:
Baselines/News:
- Moved baseline for Perfsonar to v 4.0
- 5 tickets opened by EGI to sites still running dCache 2.10. One of them already moved to 3.1
- Xrootd 4.6.1 has been released ( still not available in EPEL). https://github.com/xrootd/xrootd/blob/v4.6.1/docs/ReleaseNotes.txt
Issues:
- VOMS host certificates renewed at CERN 5 days before the expiration. Some long lasting VOMS proxies created before the day of the update started to be refused by Grid services. In particular some of the CMS proxies delegated to FTS were affected. ( new VOMS proxy and delegation to FTS was needed). This issue comes from a VOMS bug not yet fixed ( GGUS:120463)
- RHEL/SL 6.9 openssl update fall-out. openssl 1.0.1e-57 by default prohibits TLS to be used with DH keys smaller than 1024 bits. Java-based services will fail openssl client connections if their version of Java is too old or if its disabled algorithms are defined incorrectly. Java-based services need to run a sufficiently recent version of Java to avoid such problem. The latest 1.7 and 1.8 releases are OK
- ATLAS now and CMS some days ago are affected by an issue in EOS being overloaded by GSI Authentication. This problem comes from a bug in the Xrootd GSI plugin. The fix is almost ready and it will be deployed soon.
- Heads-up: Escalation of privilege vulnerability in Intel® Active Management Technology (AMT), Intel® Standard Manageability (ISM), and Intel® SmallBusiness Technology, broadcasted by EGI SVG: https://wiki.egi.eu/wiki/SVG:Advisory-SVG-CVE-2017-5689
T0 and T1 services
- BNL
  - FTS upgraded to v 3.6.8
- CERN
  - check T0 report
  - FTS upgraded to v. 3.6.8
  - EOS for LHCb updated to Citrine version ( issue on checksum string returned by Xrootd discovered after the update and immediately fixed)
- IN2P3
  - replacement of old ALICE XRootD servers ongoing.
  - Migration to dCache 2.16 for the June downtime
- JINR
  - dCache Major upgrade 2.13.51 -> 2.16.31, Postgres upgrade 9.4.11 -> 9.6.2 on tape instance for CMS
- RAL.
  - Castor nameserver updated to 2.1.16-13. All data now on T10KD dives/media.
  - Update all Castor instances to version 2.1.16-13 over the next few weeks.
  - FTS for ATLAS updated to v 3.6.8
- TRIUMF:
  - Upgrade production dCache to 2.16 soon
  - 2.16 is in MW readiness, ipv4/6 dual stack

Discussion

Renaud: UI/WN readiness for CentOS/EL7?
Andrea M:
- the meta-packages exist in preview repositories
- the CREAM client has been tested by TRIUMF
- the WN still needs an HTCondor package clash to be resolved
Maarten: the official UMD update has been delayed from May to June

Alessandra F: when might we make 1.9 the baseline for DPM?
Andrea M: investigating an issue at one site that we want to resolve first
Alessandra F:
- the motivation comes from ATLAS wanting to use a JSON file for storage accounting
- we do not want every site to invent their own scripts for that
- the DPM devs can come up with a common solution, but only for version 1.9 and later

Tier 0 News

Capacity. LSF: 550 kHSpec. HTCondor 640 kHSpec. LSF AtlasT0: 200 kHSpec. 340 kHSpec just arrived will be added to HTCondor.
Creating new HTCondor CEs with aim to move remaining Grid capacity to HTCondor ~soon.
FTS upgraded to 3.6.
Castor / EOS new capacity being added during May.

Tier 1 Feedback

see ssues with ARC CEs patching

Tier 2 Feedback

Experiments Reports

ALICE

The activity levels typically have been very high
- The average was 101k running jobs
CERN
- Some issues with the HTCondor and CREAM services
- EOS: disk capacity was added to keep up with the high activity, thanks!
- CASTOR: disk pools were reconfigured for data taking
  - Allowing last parts of 2016 reco to be finished in parallel, thanks!
A successful T1-T2 workshop was held in Strasbourg on May 3-5

ATLAS

Distributed production running fine with ~300k running job slots, reprocessing of data15+data16 has finished, MC16 production is running in full speed, derivations production on the reprocessed data started with validation runs.
Experiencing EOS overload since Friday when RAW data skimming jobs launched for HeavyIons, currently the number of jobs is throttled in the system.
Tier-0 is doing well with all recent data taking, from cosmics to splashes to first collisions. Extension of capacity to 2017 pledges requested and details agreed with IT, in progress.

CMS

CMS being prepared for LHC beam operation
- T0: running new P5-T0 transfer system since April 26th, no known issues
- Main production and processing activities
  - Still at moderate scale, more will come very soon
  - RE-RECO of 2016 data: pilot requests are being processed
  - Phase I MC requests being processed
Overloaded EOS with too many IO intense jobs
- Some limitation known in the GSI authentication capacity
- HLT nodes now authenticated by IP (not contributing to GSI load)
- Prevent too many high IO jobs at CERN resources
New input datasets for HammerCloud and SAM being distributed to all sites
Preparing for CentOS 7 (with Singularity)
IPv6 storage checks started with a subset of sites
Concluded a round of Tape staging test with the T1 sites
Meta data issue at CERN EOS - GGUS:127322
- Seems that files 'become' corrupt - files are not the ones they are supposed to be from the path

LHCb

Activity levels very high (~100k running jobs)
- HLT Farm stopped processing MC simulation, preparing for beam operation

CERN
- EOS: Failure download - fixed with new version

Ongoing Task Forces and Working Groups

Accounting TF

The April meeting has been dedicated to the Storage Space Accounting. Dimitrios Christidis started to implement the data flow using for the time being the ATLAS storage accounting data. In parallel we are working with the WLCG Data Management Steering group in order to agree on the storage reporting and storage topology description in CRIC.

Information System Evolution TF

Julia: we are moving forward with the implementation of CRIC

IPv6 Validation and Deployment TF

CERN and almost all Tier-1 sites have IPv6 and at least a fraction
of the storage in dual stack, or will in a matter of weeks
- http://hepix-ipv6.web.cern.ch/sites-connectivity?order=field_tier_type&sort=asc
No significant middleware issues
- A problem reported by the DPM team about the GridFTP redirection in Globus has been fixed (GGUS:127285)
Organization of the IPv6 tutorial during the Manchester workshop
- Note: the introduction is Wed late afternoon
- ALL site admins (and not only) are encouraged to participate (even remotely)
- Will cover the basics of deploying IPv6 in the network infrastructure
  and dual-stack Grid services (squid, perfSONAR, storage)
- Hands-on exercises foreseen

Machine/Job Features TF

Machine/Job features is a mean to optimize the interaction between a resource provider (batch system, IaaS) and the the payload (jobs) with providing more detailed information about the batch system to the job, and about the job to the batch system. This information can be static (eg. power of the machine, number of cores, local scratch space) or dynamic (eg. shutdown time of a VM). It is comprised of a set of text files and Python scripts, and can be considered as an "add-on" to the scheduler. https://twiki.cern.ch/twiki/bin/view/LCG/MachineJobFeatures

This mechanism originates from LHCb (primary expert is Andrew.Mcnab@cernNOSPAMPLEASE.ch), where it is being used successfully at T2 level for some time already: https://etf-lhcb-prod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fview_name%3Dallservices%26service_regex%3Dorg.lhcb.WN-mjf-%2Flhcb%2FRole%253dproduction%26site%3D

It is suggested to expand the use of MJF to all experiments at T2, and also to T1 level. Installation is quite easy: add repository https://repo.gridpp.ac.uk/machinejobfeatures/mjf-scripts/ and run yum to install the proper (according to the scheduler used) variant of MJF. The support requires practically no effort.

The MJF e-group is available to subscribe to: https://e-groups.cern.ch/e-groups/Egroup.do?egroupName=wlcg-ops-coord-tf-machinejobfeatures

Julia:
- last time we agreed to pursue the deployment first at LHCb sites
- the other experiments are not yet interested
- that may change with the outcome of the Benchmarking WG w.r.t. the fast benchmark
- the current deployment campaign is coordinated by Andrew and Victor

Monitoring

MW Readiness WG

This is the status of jira ticket updates since the last Ops Coord of 20170406:

MWREADY-146 dCache 2.16.34 verification for ATLAS @ TRIUMF with IPV6 as well - ongoing
MWREADY-128 - A new version of the UI bundle has been released to EGI preview with new CREAM-CLI for C7. Tested successfully at TRIUMF
MWREADY-145 - Dependency clashing between WN bundle and latest HTCondor ( classads vs condor-classads). We will most probably remove the LB libs to solve this issue.
MWREADY-9 - /cvmfs/grid.cern.ch/Grid is now mirroring the AFS WLCG Grid Applications area. Requested by LHCb

Network and Transfer Metrics WG

perfSONAR 4.0 was released on 17th of April
- 180 sites have updated so far
- Some sites reported issues with load after updating, under investigation
WLCG/OSG network services
- New central mesh configuration interface (MCA) will be deployed to production next week - transition will be transparent to all sites
  - MCA was developed by OSG and becomes part of perfSONAR.
- Monitoring based on ETF is planned to be deployed in ITB
- OSG collector will be updated to handle multiple backends (datastore, two message buses)
LHCOPN grafana dashboards established in collaboration with CERN IT/CS and MONIT team (access restricted to CERN users, public access in the works)
- https://monit-grafana.cern.ch/dashboard/db/lhcopn?orgId=14
- https://monit-grafana.cern.ch/dashboard/db/lhcopn-detailed?orgId=14
Next Throughput call will be on Wed May 24th at 4pm CEST (https://indico.cern.ch/event/640627/)

Squid Monitoring and HTTP Proxy Discovery TFs

http://grid-wpad/wpad.dat at CERN is now fully IPv6 compliant. However both the frontier and cvmfs clients prefer IPv4 on dual stack machines, so the IPv6 is not yet getting exercised. CMS has been asked to change their CERN frontier client configuration to prefer IPv6.

Traceability and Isolation WG

Last meeting on May 15th (see https://indico.cern.ch/event/634743/note/):

Traceability Challenge ran: all VOs participated
- Asking VOs to identify a job from hostname + timestamps
- Issues in the communication channels, being addressed now
- Planning to run another one in late Autumn
Singularity: "SingularityWare, LLC" created by main developer, consequences unknown

Issues with ARC-CEs patching

see the presentation

Maarten: we should ensure there is an ARC deployment discussion forum
- interested WLCG site admins should be able to join
- the developers ought to take note of the discussions
- such forums are working fine for dCache, DPM, FTS, ...
Max: there exists an ARC mailing list, but not so many WLCG sites are present
and the developers do not realize the importance or urgency of certain issues
Julia: there may be similar concerns for other MW
Andrea M: are such problems mentioned elsewhere?
Maarten: yes, but not in a consistent way; furthermore, the developers may
disagree with an RFE by a WLCG site, when nobody in NorduGrid asked for it
Julia: we need to ensure the flow of information
Andrea M: in the MW Readiness WG ARC is tested, but currently only by CMS and
there are no stress tests either
- ergo: some of the reported issues would not have been found
Alessandra F: why only CMS?
Andrea M: the efforts are voluntary, we neither can force sites nor experiments
Alessandra F: we need to avoid repetition of the gfal-utils saga
Julia: there are several issues
- we need to try and get the right things tested
- we need to ensure information is made available
- let's follow up in the next meeting
- followup on the ARC forum is an action on Ops Coordination

Theme: Providing reliable storage - TRIUMF

see the presentation

Julia: how did TRIUMF compare to other sites in the ATLAS tape performance exercise?
Simon: from our perspective the performance was OK;
note that our volume is smaller than what various other T1 have

Maarten: is tapeguy ATLAS-specific?
Simon:
- it could be generalized; furthermore, its interface is not tied to dCache
- as there were not so many options in 2006, we started and kept our own development

Vladimir: did you try disabling the power management that you suspect?
Simon: we tried different things; there is no clear pattern, the freeze occurs once per several months
Vladimir: do you have PowerEdge or PowerVault servers?
Simon: PowerEdge; the servers access the storage through a SAN

Vladimir: isn't the forced queuing time of 1h too much?
Simon: no complaints so far
Vladimir: so it is acceptable to ATLAS?
Simon: we want to increase the number of files per mount, for increased performance;
requests should come in bulk
Vladimir: CMS have also been seen to recall few files at a time,
but accumulating to tens of thousands per day;
is it acceptable to implement a wait time?
Maarten: as a site admin you have the right to protect your resources;
together with the affected experiments some compromise could be agreed
Julia: you might even tune the wait time until an experiment complains
Vladimir: could we have all T1 "impose" the same wait time?
Julia: let's see a few more such presentations and then draw our conclusions

Action list

Creation date	Description	Responsible	Status	Comments
01 Sep 2016	Collect plans from sites to move to EL7	WLCG Operations	Ongoing	The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF plan to use EL7 for new HW as of early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress. Jan 26 update: this matter is tied to the EL7 validation statuses for ATLAS and CMS, which were reported in that meeting. March 2 update: the EMI WN and UI meta packages are planned for UMD 4.5 to be released in May May 18 update: UMD 4.5 has been delayed to June
03 Nov 2016	Review VO ID Card documentation and make sure it is suitable for multicore	WLCG Operations	Pending	Jan 26 update: needs to be done in collaboration with EGI
26 Jan 2017	Create long-downtimes proposal v3 and present it to the MB	WLCG Operations	Pending	May 18 update: EGI collected feedback from sites and propose a compromise - 3 days' notice for any scheduled downtime
18 May 2017	Follow up on the ARC forum for WLCG site admins	WLCG Operations	Pending
18 May 2017	Prepare discussion on the strategy for handling middleware patches	Andrea Manzi and WLCG operations	Pending

Specific actions for experiments

Creation date	Description	Affected VO	Affected TF/WG	Comments	Deadline	Completion

Specific actions for sites

Creation date	Description	Affected VO	Affected TF/WG	Comments	Deadline	Completion

AOB

Topic revision: r24 - 2018-02-28 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback