WLCG Operations Coordination Minutes, Nov 2nd 2017

Highlights

Agenda

Attendance

  • local: Alessandro Di G (ATLAS), Andrea M (MW Officer + data management), Gavin (T0), Julia (WLCG), Liviu (security), Maarten (WLCG + ALICE), Panos (WLCG)
  • remote: Alessandra D (Napoli), Alessandra F (Manchester + ATLAS), Alessandro C (CNAF), Andrea S (WLCG), Antonio (CNAF), Catherine (LPSC + IN2P3), Christoph (CMS), David Crooks (Glasgow), David M (FNAL), Di (TRIUMF), Eric (IN2P3-CC), Felix (ASGC), Frederique (LAPP), Gen (GoeGrid), Giuseppe (CMS), Hiro (BNL), Javier (IFIC), Pepe (PIC), Ron (NLT1), Stefano (CNAF), Stephan (CMS), Ulf (NDGF), Vincenzo (EGI), Vladimir R (LHCb), Vladimir S (CNAF), Xin (BNL)

  • apologies: Marian (networks)

Operations News

  • The next meeting is planned for Dec 7
    • Please let us know if that date would present a major issue

Middleware News

  • Useful Links:
  • Baselines/News
  • Issues:
    • CRITICAL risk vulnerability concerning SLURM ( https://wiki.egi.eu/wiki/SVG:Advisory-SVG-CVE-2017-15566 ). There is a vulnerability in SLURM SPANK plugin that allows privilege escalation to root via prolog/epilog scripts, no matter whether SPANK plugin is used or not. All SLURM installations that use prolog/epilog scripts are vulnerable.
    • High risk Linux kernel local root vulnerability broadcasted by EGI SVG (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2017-7184). Up to Critical for some configuration:
      • Sites running all linux versions derived from RedHat 7.4 and setting the kernel boot parameter to enable unprivileged user namespaces will be impacted
    • CMS reported a problem with FTS@CERN affecting transfers to MIT SE from Dual Stack SEs. The problem has been fixed in globus libraries ( new version in EPEL testing available) but as a workaround we have forced IPV4 transfers on the affected links for the moment
    • FTS Optimizer performances issue @CERN. Given the high number of links managed by FTS @CERN the optimizer is not updating the links as fast as it should. We have tried to fix the problem in v 3.7.6 but unfortunately is still there. ( as a workaround we have asked ATLAS and CMS to move some of the links to the other FTS instances which seems quite less loaded)
  • T0 and T1 service
    • CERN
      • check T0 report
      • FTS upgraded to v 3.7.6
    • JINR
      • Minor upgrade: dCache 2.16.46->2.16.51; Postgres 9.6.4->9.6.5
    • KIT:
      • Updated dCache for CMS to 2.16 on October 11th and for LHCb on October 25th.
    • PIC:
      • dCache major upgrade to 3.1.15
      • FTS upgrade to 3.6.8 (non WLCG)
    • RAL:
      • FTS upgrade to 3.7.4 (both instances)

Discussion

Tier 0 News

  • CREAM/LSF retired for Grid - all Grid work should now be going via HTCondor(CE)
  • HTCondor now at 130k cores

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • The activity levels have been normal to high on average
  • CERN: job failures due to CASTOR overload
    • Partially due to the reduction of gateways to the Ceph pool
      • In addition there is a suspected server-side bug,
        which causes connections to be kept open longer than necessary
    • The CASTOR team will add disk servers borrowed from EOS-ALICE - thanks!
    • That should see ALICE through the rest of this year's run
    • Bigger changes are planned for next year's run
  • CERN: multiple EOS-ALICE crashes in the past 2 weeks
    • A bug appears to be triggered by high load from clients
    • A mitigation was applied while the matter is not understood - thanks!

ATLAS

  • Data processing at the Tier0 reaches capacity limits, due to very good LHC performance. The processing times are longer due to high and high-levelled mu values in LHC fills.
  • Smooth grid production operations with ~250-300k concurrent running job slots. Smaller HPC contributions compared to the last month with peaks up to 400k jobs.
  • The previous mentioned full Derivations campaign has been delayed and is planned to proceed in November. It will reprocess all the data AODs with inputs about 3-4PB, output double copy 8-9PB.
  • A reprocessing campaign from RAW of all data17 is planned for December and January. It will be launched in two campaigns with the 1st half to be run before the christmas break and the 2nd half in January.
  • Not optimal throughput from EOS to CASTOR. FTS folks are investigating.
    • ADG: copying from EOS to other sites was OK, but not to CASTOR
    • Andrea M: the problem was fixed by adjusting the FTS configuration

CMS

  • Tier-0 reached its limit
    • transfer backlog
    • processing backlog
      • using additional CERN resources for some PromptReco
    • CMSONR ADG delay happened twice last week and this Monday; running 1-instance as potential work-around; severity 1 call open with Oracle
  • Compute systems are busy
    • Monte Carlo samples for 2017 analyses
    • re-reprocessing of early 2017 data
    • Phase-II upgrade simulations
    • 2016 urgent samples for analysis finalization
    • thanks to sites for providing resources above pledge level right now!
  • crashes on VM used by the central data transfer instance, PhEDEx, being investigated
  • upgrade to new FTS version at CERN uncovered OSG/Globus gridftp bug affecting IPv4 only sites using an upgraded dual-stack FTS server
    • sites overcome this by switching to Fermilab's FTS server (that is at the previous version)
    • Brian developed a patch that should go into next OSG release
  • bug in new RedHat 7.4 kernel causes CVMFS mounts to be lost inside Singularity
    • hit at a few sites; overcome by switching back to old kernel
  • removal/unavailability of lxplus-ipv6 took us by surprise; advanced communication would have been nice;

LHCb

  • Usual activity mostly by MC and Data processing
  • More than usually staging for preparing for restripping RUN2 data
  • Everything is ready for restripping, validation for it already started. We hope we can launch it next week.

Ongoing Task Forces and Working Groups

Accounting TF

  • WLCG Accounting Task Force meeting took place on the 19th of October.
  • More information about recent development of Steve Jones from Liverpool University working on the tools which would allow sites to validate APEL accounting information by comparing with local accounting can be found here
  • Progress of the Storage Space Accounting project will be presented at the November GDB. Current prototype collects ATLAS and LHCb storage space accounting data using SRM. The system is being designed in a way that with the deployment of storage resource reporting at the sites (which implies deployment of an alternative way of querying free/used space) the system will transparently switch to non-SRM protocol.

Archival Storage WG

  • We would like to schedule the first phoneconf of the WLCG Archival Storage Group (wlcg-data-archival-storage@cernNOSPAMPLEASE.ch).
  • The purpose of the first meeting will be to establish what actions the group would like to undertake. As a reminder, the two triggers for the group were the following proposals:
    • establish a knowledge-sharing community for those operating archival storage for WLCG
    • understand how to monitor usage of archival systems and optimise their exploitation by experiments
    • Other proposals for topics are welcome.
  • More info is available in the slides presented at the October WLCG operations meeting.
  • Please note that progress will be dependent upon a critical level on interest being demonstrated in the above topics - if you think they are worth pursuing, please join the meeting and make this known!

Information System Evolution TF

  • Informaton System Evolution Task Force Meeting has been held on the 26th of October. The main topic of the meeting was the required support in GocDB for Storage Resource Reporting. Storage Resource Reporting proposal has been approved by October GDB
  • CRIC development is progressing well.The project has been presented at the October GDB. Couple of meetings have been held with CMS in order to discuss functionality of the CRIC CMS plugin

IPv6 Validation and Deployment TF

The text of the ticket to be submitted to all Tier-2 sites is finalised. Still waiting for a fix in the GGUS interface to submit a ticket to multiple sites. Most probably in the first round all UK Tier-2s will be targeted.

Machine/Job Features TF

Monitoring

MW Readiness WG


This is the status of jira ticket updates since the last Ops Coord of 20171207:

  • MWREADY-152 - DPM 1.9.1/1.9.2 tested at GRIF. Found a regression on 1.9.1 fixed and released on v1.9.2.

2017-12-07

NTR

2017-11-02

This is the status of jira ticket updates since the last Ops Coord of 20171005:

Network Throughput WG


  • WG update was presented at HEPiX and LHCOPN/LHCONE workshop (co-located)
  • perfSONAR 4.0.2 is planned to be released in November
  • WLCG/OSG network services
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
  • HNSciCloud meshes are being created (per provider), will enable tests between HNSciCloud sites and providers

Squid Monitoring and HTTP Proxy Discovery TFs

  • Nothing to report

Traceability WG

Container WG

  • Next meeting scheduled for mid November

Special topics

Security Operations Center WG Workshop/Hackathon

See the presentation

Review EL7 migration plans

  • CERN:
    • Around 1k cores currently available for local and Grid on HTCondor; more possible, as needed, assuming reasonable utilisation. lxplus7 has been available for some time now.
    • Some testing usage for both local and Grid.
    • Strawman schedule (still to be discussed at the next CERN Linux Certification Committee):
      • Jan 2019 (LS2): start migrating 30% of lxbatch to CC7. lxplus.cern.ch alias still pointing at SLC6.
      • June 2019 (LS2 + 6 months): change lxplus.cern.ch alias to point to CC7. Migrate 80% of the lxbatch resources to CC7.
      • June 2020 (LS2 + 18 months): stop lxplus6 and migrate remaining lxbatch6 resources to CC7.
      • Before December 2020 (tbd): CERN end of support for SLC6.
    • "It would be nice" to have Singularity in use by experiments before large-scale migration, as this will make it much easier (because experiment pilots can just instantiate what they need). Local docker deployment for batch compute will also help.
    • Services:
      • HTCondor(+CE) infrastructure nodes on CC7
      • Majority of other services already on CC7
      • Openstack hypervisors all running CC7
  • ASGC
    • Storage
      • New disk servers will come with centos7
      • For slc6 disk servers, there is no concrete plan to do migration yet, but it should be some time in 2018.
    • Worker nodes
      • All of the new coming worker nodes will be centos7.
      • Gradually decommission slc6 worker nodes.
  • BNL
    • dCache
      • All door servers and new storage servers are RHEL7 now.
      • Current plan is to have all the storage servers migrated to RHEL7 in early (Jan~Feb) 2018.
    • Compute nodes
      • One rack of SL7 worker nodes have been in production for months, using singularity to run all jobs inside SL6 container.
      • New equipment will come with SL7.
      • Current plan is to migrate all compute nodes to SL7 early (Jan~Feb) 2018.
      • CE nodes will be migrated to SL7, when the farm nodes are migrated.
  • CNAF
    • Testing EL7 on a few VMs and a real node, that we configured with UMD middleware. These worker nodes are on a separate queue and have already been given to a selected subset of experiments running at cnaf for testing (non-LHC).
    • The plan, if initial tests are ok is to install the next tender arriving on january with EL7, singularity and UMD middleware, and assign it to experiments that can support that configuration. In the meanwhile we will configure some dockers to allow experiments not ready for el7 to run on it.
    • It's important to stress the fact that now we are working on HTcondor migration, so our effort is now mainly on this task and the "unplanned" activity on EL7 support may be delayed.
  • FNAL
    • Have begun working with SL7 with some services
    • Winter of '18 will work on transitioning compute nodes to a SL7 + Docker/Singularity container model
    • FNAL LPC analysis resources will begin slow ramp from current 100% SL6 to eventual SL7 -- have had first node for users to do development work for a few months.
    • No concrete plans for further services at this point
  • IN2P3
    • Computing
      • 15 % ( ~ 5000 core) of worker nodes on CentOS7, but today only used by local jobs ( no grid). Expected to move some grid applications (not LCG) very soon.
      • Expected to have, end of 2017, CentOS7 as default platform for batch system. A SL6 platform will be keep online to satisfy the pledges due the experiments which cannot move on CentOS7.
      • We are warmly interested in the progress done by experiments about the usage of singularity in order to simplify the operating system migration.
    • Services:
      • dcache : full SL7
      • Computing element : will be migrated to SL7 after the batch nodes migration
  • JINR:we'll stay at SL6 on WNs as far as CMS support it, slowly migrate grid services to SL7 in 2018.
  • KISTI:
    • Computing: a test-bed for HTCondor-CE and its batch system will be deployed on CentOS7. UMD configuration and integration with Grid services will be tested. After the validation, whole services will move on CentOS7 gradually.
    • Storage: Validation of EOS and CentOS7 will be performed together. Once it is confirmed to be OK for migration, disk storage will move first, then tape buffer and its backend clusters.
  • KIT
  • NDGF:
    • Some dCache are on CentOS 7
    • Only one T2 on CentOS 7, T1 compute will probably change as new hardware added
  • NL-T1:
    • SARA has already migrated.
    • NIKHEF expect to do this in the spring of 2018.
      • NIKHEF first want to replace their Quattor version which cannot handle CentOS7.
  • NRC-KI
  • PIC
  • RAL
    • All worker nodes are SL7. Jobs are run in either CentOS6 or CentOS7 Docker containers depending on the requirements of each experiment.
  • TRIUMF
    • TRIUMF is ready for migrating to EL7/SL7, but about the migration timeline, it depends on the request of experiment and the readiness of services.
    • Some services are already running with SL7, for example, some dCache pools and one ARC CE.
    • Currently one cluster of compute nodes is running SL7, but on top of them we are using Docker container and it's SL6 inside the container.
    • We tested SL7 WNs and ATLAS software validation passed, thus we can migrate all WNs to SL7 when ATLAS is capable to do production on SL7.
    • We are in the process of migrating all Tier-1 services and capacities to another data centre and all new services will be SL7 at the new location.

Discussion

  • Christoph, Stephan: CMS is OK with sites going to EL7 plus Singularity
  • Maarten: CMS sites should ensure the Singularity configuration is OK for CMS

  • Gavin: HTCondor Docker containers cannot have AFS, which other CERN users still need

  • CNAF: how should the EL7 WN be configured, as YAIM is not supported for that OS?
  • Alessandra F: at Manchester we integrated the WNs in our Puppet configuration. Some of the classes can be shared.
  • CNAF: we will try configurations without YAIM first for the dteam VO
  • CNAF: we are looking into HTCondor for later, we intend to upgrade CREAM + LSF WN first

Theme: Providing reliable storage - BNL

See the presentation

  • ADG: if clients go through proxy gateways depending on their location,
    can that be used to throttle external vs. internal traffic?
  • Ulf: external traffic is indeed capped like that at the Finnish T2
  • Hiro: the distinction is purely based on IP addresses

  • Maarten: do you use the dCache resiliency mechanism for some of the data?
  • Hiro: indeed, we use "archival" hard drives that cannot handle many clients
    at the same time, but can hold extra copies of important data

Action list

Creation date Description Responsible Status Comments
01 Sep 2016 Collect plans from sites to move to EL7 WLCG Operations Ongoing The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF plan to use EL7 for new HW as of early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress.
Jan 26 update: this matter is tied to the EL7 validation statuses for ATLAS and CMS, which were reported in that meeting.
March 2 update: the EMI WN and UI meta packages are planned for UMD 4.5 to be released in May
May 18 update: UMD 4.5 has been delayed to June
July 6 update: UMD 4.5 has been delayed to July
Sep 14 update: UMD 4.5 was released on Aug 10, containing the WN; CREAM and the UI are available in the UMD Preview repo; CREAM client tested OK by ALICE
03 Nov 2016 Review VO ID Card documentation and make sure it is suitable for multicore WLCG Operations Pending Jan 26 update: needs to be done in collaboration with EGI
26 Jan 2017 Create long-downtimes proposal v3 and present it to the MB WLCG Operations Pending May 18 update: EGI collected feedback from sites and propose a compromise - 3 days' notice for any scheduled downtime;
Oct 5 update: as both OSG and EGI were not happy with the previous proposals, and as this matter does not look critical, we propose to create best-practice recipes instead and advertise them on a WLCG page
14 Sep 2017 Followup of CVMFS configuration changes,
check effects on sites in Asia
WLCG Operations Pending  

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion

AOB

-- JuliaAndreeva - 2017-10-30

Edit | Attach | Watch | Print version | History: r35 < r34 < r33 < r32 < r31 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r35 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback