WLCG Operations Coordination Minutes, September 1st 2016

Highlights

  • The new accounting portal with changes performed for the WLCG view is ready for validation: https://accounting-next.egi.eu/. People are encouraged to start using it and provide their feedback or report eventual problems to wlcg-accounting-portal@cernNOSPAMPLEASE.ch.
  • Regular Monitoring Consolidation and Traceability WG reports available in the Operations Coordination meetings from now on.
  • EL7 plans from experiments were presented at today's meeting and summarised in the minutes twiki. Site plans to be collected by WLCG Operations.

Agenda

Attendance

  • local: Maria Alandes (minutes), Alberto Aimar, Stefan Roiser, Luca Tomassetti, Vincent Brillault, Thomas Oulevey, Maarten Litmaath, Julia Andreeva, Andrea Sciaba, Helge Meinhard, David Cameron, Andrea Manzi, Maria Dimou, Marian Babik
  • remote: Pepe Flix (chair), Nurcan Ozturk, Ult Tigerstedt, Felix Lee, Di Qing, Javier Sanchez, Christoph Wissing, Olivier Devroede, Antonio Maria Perez Calero Yzquierdo, Kyle Gross, Massimo Sgaravatto, Shkelzen Rugovac, Victor Zhiltsov, Dave Mason, A. Cavalli, Hironori Ito, Dave Dykstra, Ron Trompert, Vincenzo Spinoso, Eygene Ryabinkin, Tim Bell, Andrea Valassi

Operations News

Middleware News

  • Baselines/News:
  • Issues:
    • GGUS:123613 reports crashes on openldap with ARC-CE on C7 ( similar problem was affecting also SL6 and has been fixed). The supposed fixed rpms are available on RHEL 7.3 beta 1, but a first test showed that the issues is still there. We have contacted ARC Devs to do more testing.
    • GGUS:120586 concerning an issue with glite-ce* clients with dual stack CEs. No news yet
  • T0 and T1 services
    • CERN
      • Upgrade to xrootd 4.4 all CASTOR instances
      • ARGUS upgraded to 1.7.0 ( C7)
      • FTS upgraded to v 3.4.7
    • IN2P3
      • Minor dCache upgrade planned (2.13.32->2.13.41)
    • JINR-T1
      • Minor updates: dCache 2.10.59->2.10.61, Postgres 9.4.8->9.4.9
      • dCache 2.13.x in later November
    • NDGF
      • 2.16.10 last week and moved to HA messaging
      • Switch to a fully HA setup for database and SRM within September
    • RAL
      • FTS upgraded to v 3.4.7

Tier 0 News

Highlights

  • Network traffic out of CERN peaked at 253Gbps in July, mostly driven by increasing volumes of LHC data being exported. To cope with the increased export rates, links to some Tier-1 centres have had their capacity increased.
  • Traffic volumes have also increased over LHCONE leading to GÉANT to deploy additional capacity. An estimate of likely traffic in 2017 will be required.
  • Due to record LHC performance, in July and August, about 10 PB and about 6 PB were recorded into Castor, respectively.
  • A couple of CERN-Wigner link instabilities were impacting EOS performance (a reconfiguration recipe is now in place to cope with these situations). Other transfer issues previously reported (mainly affecting ALICE and CMS) have been solved.
  • Minor upgrades (maintenance) have been completed on CASTOR (v. 2.1.16-6) and are still continuing on EOS (moving all instances to 0.3.193) in agreement with the experiments. The hardware retirement for Q3 is complete. Smaller retirements are expected in Q4 or beginning of 2017.
  • Obsolete WLCG-related Oracle accounts (for example SAM and GridView) are being deleted. Another major clean-up will take place next year after the migration of experiment dashboards to Hadoop.
  • Work is going on with CMS on using Spark on the the CERN Hadoop services for CMS' use cases of Big Data Analytics and Deep Learning. There is a series of tutorials on Hadoop and related technologies to help users get started for their data processing use cases.
  • The Oracle security upgrade of July was not deployed, as it contains nothing relevant for the CERN installation.
  • The LSF instance for public batch has reached its supported limit of 5'000 nodes. Further capacity increases are implemented in the HTCondor pool.
  • The main HTCondor pool has reached 21'000 cores; users are starting to test local submissions. A significant number of worker nodes are being regularly defragmented in order to support multi-core as well as single-core grid submissions.
  • The external cloud resources (some 4'000 cores) are now processing jobs from all LHC experiments.
  • ARGUS has been upgraded to the 1.7 release on CentOS 7, which supposedly fixes all known bugs. Since the upgrade, the service has run with good stability; it hence appears as if the previous issues of glexec probes for especially CMS failing was now resolved.
  • New resources have been added to the public batch pools and the ATLAS Tier-0 instance.

Issues

  • There have been performance problems in the CMS offline database (CMSR) caused by misbehaving applications. Work is going on with CMS to improve the applications concerned.

Highlights (for information only, not strictly Tier-0 related)

Tier 1 Feedback

Pepe invites Ulf to summarise the information that Ulf sent to the Ops Coord co-chairs on a mail thread before the meeting where he described NDGF plans to move to EL7. Ulf summarises that indeed some NDGF sites are going to deploy new HW early 2017 and it would be good to understand experiment plans to know whether EL7 can be used or not.

Tier 2 Feedback

  • T2_BE_IIHE: network issues with Russian T1/T2
    • Core problem: Russian sites not accessible via Géant (hence hitting restriction on our commercial traffic)
    • Main identified impacting use case: Phedex dataset transfers coming from T1_RU_JINR
    • Investigated solution: LHCONE (was not a solution)
    • Is it clearly acknowledged? What other Tier 2 sites do? What is the recommendation?

RU-VRF (Eygene Ryabinkin) answer: we (.ru LHCONE) aren't currently connected to GEANT and I don't see any progress here; regarding this very site (going to the network via Belnet), we can try to peer with Belnet's LHCONE space directly, we're present at AMS-IX where Belnet is also present, so that's solvable technically. We had already offered this option to Belnet, but to no avail. We have direct peering with Belnet itself (in Amsterdam), but, probably, Belnet wants some money from T2_BE_IIHE for this kind of connectivity.

It is agreed that the affected Belgian sites get in touch with both Belnet and Edoardo to understand how traffic from Russia could stop coming through the commercial route and use LHCONE links instead. For these cases there's no straightforward solution and every case needs to be investigated independently since each country has its own connectivity.

LHCb mentions that they experienced some problems last week to download data from Russia into the T-Systems cloud resources and it could very well have a similar explanation.

Experiments Reports

ALICE

  • Typically high to very high activity
    • New record: 110636 concurrent jobs
  • CERN: as of July 28 jobs are also being submitted to the T-Systems external cloud resources
    • up to 3k+ cores have been used in parallel
    • job success rates have been good
    • job types include simulation, reco and user analysis
      • analysis trains were excluded to reduce the load on the 10 Gbps link to CERN
    • use of the DPM has been postponed, pending a fix to allow the disk servers to be monitored in MonALISA
      • big UDP messages are not correctly handled by the T-Systems NAT setup

ATLAS

  • Smooth operations towards ICHEP, end user data delivered on time.
  • Grid is now full with Monte-Carlo production from the new Sherpa samples. There was occasional lack of payload and the grid was not fully utilized in August, but back to normal in the last week.
  • To help with T0 backlog 3 runs were spilled-over to grid and successfully ran. More automation is underway.
  • New T0 workflow called RAWtoALL (skip producing intermediate ESDs) was deployed at T0 (15% gain in CPU), it was also tested on the grid for spill-over, more tests are upcoming to understand/control the high memory usage better.
  • Slots unused by T0 processing are fully used by grid.
  • Waiting for news from CERN-IT on the new monitoring based on the new infrastructure, we would like to have our ADC live pages monitored there and provide feedback.

CMS

General CMS Activities

  • Successful data taking with rather high logging rate (s. also below)
  • High production activity in July to prepare for ICHEP
  • Larger disk cleaning campaign finished
    • We likely need another one to host coming data and 2017 MC
  • Tape cleaning campaign is ongoing
    • Actual deletion requests to be sent to the sites in September
    • Allow sites to re-pack (if needed) in October
  • New MC Campaign started
    • Jobs are now multi-threaded (4 threads)
    • Relies on remote data access to reading premixed pileup data

Various Operational Items

  • Wigner network link
    • Redundancy was limited several times over the last few weeks
    • On July 25th P5 to Tier-0 transfers badly affected
    • No trouble reports in August

  • Data export from CERN
    • Developed a multi-PetaByte backlog of data to be exported
    • Transfer speed out of CERN greatly improved in July
      • Additional GridFTP doors got deployed in the CERN EOS instance
      • Faulty network components got identified and fixed
    • CMS reduced volume that needs to be transferred out of CERN
    • Present performance expected to be sufficient for remaining pp run

  • Monitoring infrastructure
    • Observed various Kibana instabilities: July 15th, Aug 9th, between Aug 24th-29th
    • Access issues for Dashboard metrics

LHCb

  • High activity
    • Data taking continues, transfers from the PIT are ok
    • Reconstruction, Stripping, MC and users jobs running "fine"
      • Reco a little behind waiting for data-transfers but catching up
  • T-Systems in use: download and upload problems from everywhere (~20% of the times) due to network connection. IT people is aware.

  • CERN lbvobox108 inaccessible because of hypervisor died (GGUS:123484). Alarm ticket submitted 18 Aug at 19:17. Fixed on 19/08 with the VM reachable at 11:40 (motherboard replaced).

Ongoing Task Forces and Working Groups

Accounting TF

Dave asks whether there is OSG people involved in the TF and Julia answers that indeed Tanya Levshina is part of the TF and she recently presented the status of OSG accounting in a recent meeting.

Pepe asks how people could give feedback on the new portal. Julia says there is an egroup for this wlcg-accounting-portal@cernSPAMNOT.ch

Information System Evolution


  • At the last MB a proposal to adopt CRIC as the new Information System was approved. A new project associate will join the development team in the next weeks.
  • The next IS TF meeting is scheduled on 22nd of September. Information sources and main functionality of central CRIC will be discussed.

IPv6 Validation and Deployment TF


Machine/Job Features TF

Working through the last step of the mandate: "Plan and execute the deployment of those implementations at all WLCG resources"

MJF deployments on batch passing tests at Cambridge (Torque), GridKa (Grid Engine), Liverpool(HTCondor), Manchester (Torque), PIC (Torque), RAL Tier-1 (HTCondor), RAL PPD (HTCondor). Aware of other sites currently working on deployment (e.g. CERN, CC-IN2P3.) UK intends to deploy at all sites.

All Vac/Vcycle VM-based sites have MJF available as part of the VM management system.

Pepe will get in touch with Andrew to report some outdated links in the TF twiki.

Monitoring Consolidation

Link: http://monit.cern.ch (open to all CERN authenticated users)

  • Will start reporting on the progress of Monitoring at this meeting.

  • Progressing with adding data in the new portal based on Kibana and ElasticSearch (FTS, Xrootd, Job Monitoring, SAM/ETF)
  • Prototyping on plotting and reporting using the data stored in HDFS.

  • Service (not in production) less stable in the last few weeks.
  • Increasing and restructuring the ES (ElasticSearch) infrastructure in order to be more scalable and independent on other ES projects.
  • Participating to the benchmarking of ES resources.

On a question whether sites are also participating in the project, Alberto answers that this is not a WLCG project, so in principle sites are not involved, but of course, useful feedback from sites can be taken into account. Maarten asks how could sys admins report possible issues and Alberto answers that there will be means to do this through GGUS/SNOW.

Pepe also mentions that in any case it would be very good to hear from CERN's experience in this project as it could be useful for other sites. Alberto is happy to give regular updates on the status of this.

MW Readiness WG


  • Not so many activities during the summer holidays
  • MWREADY-133 FTS 3.4.7 verification for ATLAS & CMS completed at CERN
  • MWREADY-136 FTS 3.5.0 for ATLAS & CMS at CERN - on-going
  • MWREADY-135 WN bundle for C7 ( on preparation, most probably ready by the end of the month)
  • MWREADY-137 ARC-CE 5.1.1 on C7 to be tested by Raul at Brunel. The fix for opendap on C7 is breaking the ARC-CE InfoSys
  • Due to summer holidays, WLCG Workshop & CHEP preparations and aftermath, the proposed date for the next meeting is Wed Nov. 2nd @ 4pm CET. Please email the e-group of the WG for comments.
  • We miss very much LHCb participation in the WG meetings and email exchanges. E.g. we know nothing on their CentOS7 plans, despite 2 reminders during this meeting's preparation.

LHCb will provide the information about EL7 requested by the TF at today's meeting.

Network and Transfer Metrics WG


  • Network session is planned at the WLCG workshop covering IPv6, LHCOPN/LHCONE status and LHC network evolution
  • LHCOPN/LHCONE workshop will be held in Helsinki, Sept 19-20 (https://indico.cern.ch/event/527372/)
  • pre-GDB on networking focusing on the long-term network evolution postponed to January
  • Throughput meetings were held on July, 27 and August, 16:
    • Mark Feit from Internet2 presented pScheduler (new test scheduler in perfSONAR 4.0)
    • Xinran Wang and Ilija Vukotic from Univ. of Chicago presented their Network Analytics work
  • OSG datastore and collector are experiencing problems since the upgrade last week, the issue is being followed up by OSG
  • Plan on migration to the new perfSONAR 4.0 configuration was drafted and will be followed up with OSG
  • We are now using a new mailing list wlcg-network-throughput-wg@cernNOSPAMPLEASE.ch - joint mailing list for European and NA throughput meetings
  • WLCG Network Throughput Support Unit: New cases were reported and are being followed up, see twiki for details.

On a question on whether the WG should follow up on the issues reported by the Belgian sites, Marian replies that since the issue is more architectural, it's better to follow up with LHCONE.

RFC proxies

  • OSG have made RFC proxies the default in their latest release (3.3.15)
    • they submitted a voms-proxy-init patch to the VOMS developers
  • an "official" new version of the VOMS client rpm should be released soon (GGUS:122269)
  • a new version of YAIM core for myproxy-init on the UI is available
    • temporary location: here
  • the intention is for the UMD to be updated when both rpms are available

Squid Monitoring and HTTP Proxy Discovery TFs

  • http://wlcg-wpad.cern.ch/wpad.dat now returns proxies based on the squids registered in ATLAS AGIS. http://wlcg-squid-monitor.cern.ch/worker-proxies.json contains the input to the service. Quite a number of sites are disabled because of overlapping GeoIP organizations.
  • Inclusion of squids from CMS SITECONF has had the initial development done, and it needs more refinement. It still needs to map CMS site names to those registered in GOCDB/OIM. Also CMS SITECONF allows the client to do load balancing between squids, which the WPAD format does not natively support. The format is a subset of javascript, however, so the plan is to distribute load based on the lowest order bits of the client's IP address.

Traceability and Isolation WG

More details on the WG website

  • Traceability:
    • There was an agreement that we should build upon each existing VO logging infrastructure, which seems to already cover the expected logs. We might seek minor adjustment, but no global rewrite
    • VOs invited to do a self-assessment: identify the cost and bottleneck for identifying a user and the associated payload from a timestamp and an IP address/workernode name.
  • Isolation:
    • There are missing components for full unprivileged containers on RedHat 7.2. OSG and CERN have open tickets/feature request. RedHat has been expressing security concerns...
    • Brian Bockelman has identified an opensource software solution aimed at unprivileged "containers" (with no daemon) and has been working on a building a RHEL6 job environment from CVMFS, to be presented at the next meeting.
  • Next meeting: September 14 at 1600 (after the GDB)

Theme: CentOS7 - Experiment plans and info for sites

Challenges of CentOS7 migration

Thomas presents the timeline of EL7. Now it's the right time to propose changes as it is the middle of the first production phase and afterwards it will be more difficult to add new features and bug fixes. It's good to discover issues as soon as possible. It has to be taken into account that the release process in Redhat could be very long.

On a question on whether Redhat 8 is already planned, Thomas says there are no dates yet.

Pepe asks when it is planned to add missing MW in EL7 in EGI repos. Maarten replies that the missing packages are expected to be made available this autumn.

Thomas mentions the unprivileged mount namespace feature needed by WLCG has been rejected for CentOS 7.3 because of security concerns. It will be asked again for CentOS 7.4 next year. Maarten explains that this is a very important feature for WLCG as it would allow pilot jobs to create fully functional SL6 containers on CentOS 7 hosts, which would be the straightforward solution if experiments still need to run SL6 software while sites need to move to CentOS 7 for new HW.

ALICE

  • the ALICE offline packages in CVMFS are still built on SLC5
    • used on SL6 (at most sites) and a few other OS flavors
  • an official validation has been run on CC7
  • the results were identical to those obtained on SLC6
  • conclusion: ALICE looks ready to run on CentOS/EL7

Tim asks whether there are any plans to stop building on SL5 and it is coming to its end of life. Maarten replies that this will be discussed very soon.

ATLAS

See ATLAS plans in the slides attached in the agenda. There are no plans to move ATLAS central services and WNs until LS2. If sites need to move to EL7 WNs, feedback is very welcome. ATLAS SW built on SL6 is being tested on EL7. Builds in EL7 are available and are being tested in EL7 sites. Lcg-utils is not available in EL7 but it is being phased out in ATLAS and sites that need to move to EL7 can already use either gfal-copy or xrdcp if they want to move. Sites who want to test ATLAS SW in EL7 should contact Alessandra.

CMS

See CMS plans in the slides attached to the agenda. It's not possible to run SL6 binaries on EL7. SW is built on EL7 but not tested yet. CMS needs feedback from WLCG on migration plans to EL7 so that CMS understand whether a container solution could be envisaged or not to meet the timeline.

Stefan mentions that the official timeline to move to EL7 announced by WLCG is until the end of LS2. Maria mentions that this needs to be revisited as sites seem to have earlier time constraints now to install EL7 and WLCG would need to acknowledge this and work to cope with the new situation, finding solutions and recipes to be able to run SL6 on EL7 machines. Sites who have the effort to deploy VM or container solutions are OK, but sites who don't have this effort and buy new HW deploying EL7 present the experiments with a problem.

It is mentioned that there is a new (setuid) tool called Singularity that seems to offer an easy container solution. Brian Bockelman will give more details at the new Traceability WG meeting.

Tim asks whether the EL7 timeline is end or start of LS2? Helge replies that SL6 was supposed to be the official OS until the end of RUN 2.

It is agreed that WLCG could assess the current situation by collecting from sites their plans to move to EL7, to stay with SL6 and their current knowledge and available effort to deploy VM or container solutions.

LHCb

  • Simulation workflows with slc6 binaries have been executed on EL7 worker nodes successfully
    • No firm confirmation yet for slc5 binaries (very few simulation workflows) but seems possible
  • What are plans for moving the WLCG middleware, VOBOX infrastructure and lxplus to EL7?
    • The lxplus alias should be moved as the last action in the migration procedure to the EL7 cluster
  • What are the future plans for HEP_OSlibs meta-package? LHCb interested in continuing to use it.

Stefan asks how long will SL6 be supported. Tim replies as long as it is needed. Tim adds that for VOboxes sys admins can choose to deploy sl6 or el7 images. Lxplus needs to be coordinated with experiments, lxplus7 is already available. Stefan mentions that the alias change should only happen at the very end. Maarten adds that HEP_OSlibs is already available for EL7 in the WLCG repo and that EGI wants to offer the complete MW stack on EL7 this autumn/end of year.

Helge mentions that performance indicators for EL7 show 5% gain in performance.

Regarding UI and WN, there is already a testing folder for the EL7 UI and there will be one for the WN too.

Action list

Creation date Description Responsible Status Comments
01.09.2016 Collect plans from sites to move to EL7 WLCG Operations NEW  

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
29.04.2016 Unify HTCondor CE type name in experiments VOfeeds all - Proposal to use HTCONDOR-CE. Waiting only for ALICE and LHCb now.   Ongoing

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion

AOB

-- MariaALANDESPRADILLO - 2016-08-30

Edit | Attach | Watch | Print version | History: r35 < r34 < r33 < r32 < r31 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r35 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback