WLCG Operations Coordination Minutes, December 1st 2016

Highlights

  • The Network and Transfer Metrics WG asks for experiments' participation at the dedicated Jan. 10th pre-GDB.
  • The RFC Proxies TF completed its mission and will be closed.
  • CMS is campaigning with its sites the migration to FTS3 client till year end, provided all needed PhEDEx changes are completed.
  • The Lightweight Site survey http://wlcg-survey.web.cern.ch/survey/lightweight-sites is still open for contributions.
  • Sites and Experiments, please provide feedback for the Advanced Warning of Long Shutdowns proposal available HERE before Jan 5th by sending a mail to wlcg-ops-coord-chairpeople at cern dot ch.

Agenda

Attendance

  • local: Maria Alandes (chairperson), Maria Dimou (notes), Alberto Aimar, Maarten Litmaath, Julia Andreeva, Marian Babik, Jerôme Belleman, Marcelo Soares, Andrea Manzi, Alexander Kryukov, Mark Slater, Andrew McNab:
  • remote: John Gordon, Alessandra Doria, Catherine Biscarat, Stephan Lammel, Di Qing, Dave Mason, Kyle Gross, Thomas Hartmann, Ulf Tigerstedt, Frédérique Chollet, Alessandra Forti, Christoph Wissing, Massimo Sgaravatto, Oliver Keeble, Renaud Vernet, Antonio Perez-Calero Yzquierdo, Carlos Acosta, Hung-te Lee, Robert Ball, Pepe Flix, Elena Korolkova.
  • Apologies: Nurcan Ozturk

Operations News

  • Next WLCG Ops Coord meeting will be on 12th January 2017 (exceptionally skipping the 1st Thursday of the month rule!)
  • There will be a WLCG Workshop in 2017 (End June/Beginning July). Sites interested in hosting the workshop, please, get in touch with wlcg-ops-coord-chairpeople@cern.ch. More details coming soon.

Middleware News

  • Useful Links:
  • Baselines/News:
    • As broadcasted by EGI, sites should start using UMD4 repo instead of UMD3
    • UMD 4.3.0 has been released: http://repository.egi.eu/2016/11/10/release-umd-4-3-0/ .
    • UMD 4.3.1 also released:
      • CentOS7: lcas-lcmaps-gt4-interface 0.3.1, lcmaps 1.6.6, lcas 1.3.19, glexec-wn 1.3.0, lcmaps-plugins 1.7.1 etc.
      • SL6: ARGUS 1.7
    • We suggest sites move to ARGUS 1.7 as it fixes instabilities (to be added as baseline at some point)
    • New versions of xrootd-server-atlas-n2n-plugin and dcache-xrootd-n2n-plugin pushed to the WLCG repo.
    • glite-yaim-core-5.1.4-1 released to WLCG repo (and under integration in UMD)
      • It fixes GGUS:125108 (affecting CREAM) and adds EL7 support
    • glite-yaim-clients-5.2.1-1 released to WLCG repo. It fixes some issues related to WLCG VOBOX on EL7
  • Issues:
    • NTR

Maarten reminded that EMI3 repository is no longer maintained. Please move to UMD4 a.s.a.p.

  • T0 and T1 services
    • BNL
      • FTS upgraded to v 3.5.7
    • CERN
      • check T0 report
      • FTS upgrade to v. 3.5.7, Disabled SOAP on the Pilot cluster
    • INFN
      • If Atlas does not recover the file protocol they need to increase the number of gridftp server
    • JINR
      • dCache minor upgrade 2.13.48 -> 2.13.49
    • KIT
      • Update of dCache for CMS to 2.13.48, which fixes issues with dccp pre-staging. Created cmsxrootd-2.gridka.de as a new CMS AAA redirector, though no feedback yet.
      • FAX services will be decommissioned with next opportunity at KIT, though no date is fixed as of now
    • RAL
      • FTS upgraded to v 3.5.7
    • RRC-KI
      • dCache for ATLAS and LHCb disks storage upgraded to 2.16.18-1
      • Upgrade dCache for tape instance to 2.16.18-1 planned

Tier 0 News

  • Some CC7-based capacity has been made available in HTCondor for local submission.
  • An issue with local job submission to HTCondor has been acknowledged by the Condor team; a fix is being prepared. Meanwhile a workaround has been deployed at CERN.
  • We are re-publishing data into APEL from April 2016 on. The new accounting summaries originate from the new system, and are hence expected to be more accurate.
  • During November about 5 PB have been recorded into CASTOR. For the p-Pb run, CASTOR ALICE is now using a Ceph-based pool in addition to the standard disk resources. Progressively all the (re-)processing traffic has been moved to Ceph completely replacing a 1.5 PB staging area.
  • Some instabilities in EOS have been observed and handled.
  • FTS has been upgraded to 3.5.7.
  • The network intervention on November 2nd did not cause any major disruption on the Tier-0 services.
  • A block corruption affected a table in COSMIC@LCG database system, which caused a downtime of the LCG database. Within four hours it was fully restored.
  • Backup is being rolled out for applications hosted in the Hadoop service.
  • An issue leading to long delays replicating the CMSONR database from P5 to the computer centre has been worked around by enabling compression.
  • A development instance of Apex 5 has been made available.
  • Progress is being made on the link between the GEANT PoP in Budapest and the Wigner data centre; we can thus expect the 3rd link to be commissioned soon.
  • We received an unexpected request to postpone maintenance work on the Wigner link until after the last week of heavy ion running in case it led to problems with data recording. Experiments should not be relying on the availability of this link to take data, as we cannot exclude situations where traffic would not be able to reach above 100 Gbps.
  • (For information - not a Tier-0 topic) Orange have completed optimisation of their network in the Pays de Gex and have good to very good coverage for 2G/3G/4G services. Some residual issues are being worked on.

Tier 1 Feedback

  • NDGF-T1 recovered all but 8 files from the +35TB/1.2 million files from a broken raidset (3 disks lost from a RAID-6). Real cause was actually SAS cable or disk backplane, so the controller just lost contact with the 3 drives. Parts replaced, everything working again.

Tier 2 Feedback

Experiments Reports

ALICE

  • HI data taking progress as planned
    • Recorded data is being calibrated and quality-checked quasi-online
  • High to very high grid activity on average
  • Several incidents with job failures related to the Stratum-1 at ASGC:
    • Experiment package updates may be absent or late, presumably due to network issues
    • Also observed by ATLAS
    • The biggest ALICE sites in Asia have bypassed the Stratum-1 at ASGC to avoid random job failures
    • The CVMFS developers intend to make the client yet more robust in failing over to other
      instances when the default Stratum-1 cannot deliver a requested file for whatever reason
    • They also provided test commands that a pilot could run to check the health
      of the local CVMFS setup before committing itself to a user payload
    • The Stratum-0 hosts have been put on the OPN for better network connectivity
      • That appears to have improved the situation for ASGC
  • Intermittent staleness has also been observed for other Stratum-1 instances
    • In particular for alice-ocdb.cern.ch which is updated many times per day
    • Will be followed up further

Maarten highlighted the issues met with Stratum-1. Stratum-0 hosts are now on the OPN. Performance should be better.

ATLAS

  • Smooth data taking and processing for Heavy Ion run.
  • Many competing high-priority activities on the grid; MC production, derivation production for Moriond, Upgrade simulation, BPhys_delayed stream processing (ran successfully from tape without pre-staging), Release 21 validation.
  • Tier0 is running beam spot reprocessing.
  • Task force to study derivation production throughput was formed; CPU efficiency, memory usage, job settings, output merging options (slow vs fast) are being studied.
  • Migration of group datasets to tape ongoing. 2.7PB to be put on tape.
  • CVMFS service at TAIWAN was getting behind regularly in updating to latest version of contents, running fine this week.

Nobody present or connected from ATLAS.

CMS

  • Heavy Ion Run
    • No major issues on the computing side
    • Recording data close to the DAQ limit in order to collect maximum number of events
  • Monte Carlo production
    • DigiReco jobs that in part also use the premixing technique
    • Moved T2 fair-share target temporarily from 50%:50% to 75%:25% to favor Moriond 2017 MC production over analysis
    • DigiReco part of the Moriond 2017 production use the pileup premixing technique

  • Requesting sites to adjust cleaning procedure of /store/unmerged to accommodate long lasting workflows
  • PhEDEx agents mandatory upgrade to version 4.2 is ongoing
  • GlideinWMS: two dedicated hi-performance VMs provided by CERN, being evaluated in Global Pool scalability test

  • CMSWEB: scaling issues being investigated
  • One bad worker node at GRIF produced corrupted files. Quite peculiar problem under deep investigation CMS GGUS:125142
  • DashBoard outage two weeks ago GGUS:125037, GGUS:125089

Andrea M. asked if CMS is able to move to the new FTS3 client. Stephan Lammel confirms that all CMS sites are asked to move to FTS3 before the end of the year. Christoph said there was a need for PhEDEx upgrade before. Open Action. CMS will open GGUS tickets to sites. Check-point in January.

LHCb

  • All activities progressing well
    • Data taking continues, transfers from the PIT are ok
    • Heavy Ion processing ramping up
    • Some MC jobs running though waiting on more requests from the collaboration

Ongoing Task Forces and Working Groups

Accounting TF

  • Task force proposal for changes in the accounting portal and accounting reports has been approved by the MB
  • Changes will be implemented and validated during next week
  • For November two sets of Reports will be sent around, current ones and new ones (generated by the portal).
  • In January we should be able to switch to the new reports

Julia emphasised that Ivan Diaz, developer, is progressing well. The November reports will be sent in December in two versions.

Information System Evolution


  • VOfeed management and documentation discussed at the last IS TF meeting.
    • Working on a twiki page where VOfeed structure will be documented and discussed at the next meeting.
    • It was decided that VOfeed changes and general strategy will be discussed at the IS TF from now on.
  • Ongoing discussions on which syntax based on GLUE 2 should be used to enter more info in GOCDB.
  • New person selected to work on the CRIC project. Contract procedure on going. Likely to start in January.
  • Next IS TF meeting will take place on 8th December:
    • GOCDB writeable API
    • VOfeed Documentation
    • Proposal to introduce extended attributes in GOCDB
  • Reminder on REBUS known issues regarding capacitiy numbers. Please check them before opening any GGUS ticket or spending time in understanding capacity numbers.

Maria A. highlighted the point on REBUS.

IPv6 Validation and Deployment TF


No report.

Machine/Job Features TF

  • DB12 benchmark included in mjf-scripts distribution for all batch platforms

Monitoring

Full report later today.

MW Readiness WG


This is the status of Actions from the 19th MW Readiness WG meeting of 20161102.

  • 20161102-05: Christoph to investigate EL7 UI testing by CMS. Keep Andrea S. informed as maintainer of the workflow twiki.
  • 20161102-04: Andrea M. to update the pakiti documentation.
  • 20161102-03: Maria to remove the out-of-date Tasks overview from the WG twiki. DONE twiki up-to-date and announced on 20161201.
  • 20161102-02: Stefan to appoint a LHCb member to join the WG. DONE Marcello is appointed.
  • 20161102-01: Andrea S. to update the CMS workflow twiki.
  • 20160518-02: EL7 experiments' intentions Done via Ops Coord on Sep 1st - see details on the agenda

This is the status of jira ticket updates since the last Ops Coord of 20161103:

  • MWREADY-104 DPM 1.9.0 SRM-less verification for ATLAS at LAPP Annecy - on-going
  • MWREADY-139 FTS 3.5.7 for ATLAS & CMS at CERN - completed
  • MWREADY-140 ARC-CE 5.2.0 on EL7 for CMS at Brunel - on-going
  • MWREADY-30 ARGUS for CMS at Brunel - no update for a month - status unknown

Maria Dimou is changing duties in CERN IT and leaves this WG in the competent hands of the MW Officer Andrea Manzi. Grateful to the excellent work by the Volunteer Sites and the experiment contacts!

Network and Transfer Metrics WG


  • pre-GDB on NETWORKING will take place on 10th of January, preliminary agenda now available at https://indico.cern.ch/event/571501/
    • If you plan to attend register at https://indico.cern.ch/event/571501/ to help us with logistics
    • Please let us know what you would like to see come out from this meeting. If there are additional topics you would like to see in the agenda or modifications to existing items, please let us know.
    • Invitation was sent to all four experiments
  • Next throughput meeting planned 14th of Dec
    • Focus on perfSONAR RC validation
  • perfSONAR team announced that they plan to release 4.0 RC3, which will push final release to next year
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activites.

Marian highlighted the 10 Jan 2017 pre-GDB. Please send comments to the WG. LHCb and CMS to announce participants. Invitation was sent to the experiment computing coordinators. CMS participation is highly desired.

RFC proxies

  • The next major version of ARC will not support legacy proxies!

  • RFC proxy failures for some services and some CAs (GGUS:124650)
    • certificates of affected CAs have the non-repudiation flag set
      • so far only GridCanada certificates were seen to be affected
      • there exist more such CAs
    • affected services are dCache SRM < 2.14 and BeStMan/EOS SRM
    • the consensus is that the fault lies in JGlobus, used by both
    • JGlobus is not officially supported by anyone these days
    • the dCache team and OSG are looking into "private" builds with an easy fix

  • VOMS clients 3.0.7 and YAIM core 5.1.3 are fully available now
    • UMD-3 updated Nov 7
    • UMD-4 updated Nov 10 for SL6
    • EPEL 7 has VOMS for EL7
    • WLCG repo has YAIM core for EL7

  • RQF:0675511 has been opened to get lxplus fixed (not yet done):
    • latest VOMS clients
    • correct GT_PROXY_MODE environment variable

  • Experiments may want to check where legacy proxies are still being used
    • and switch those areas to RFC proxies at their convenience

Maarten emphasised the Nov. presentation by Mattias. The JGlobus issue described in the GGUS ticket is still pending but followed-up by the dCache developers and OSG. Also the VOMS client news and the ticket for lxplus to be updated.

This TF can be closed now.

Squid Monitoring and HTTP Proxy Discovery TFs

  • CMS is now using the WLCG WPAD service in production, at about half a dozen opportunistic sites
  • The service was slightly expanded to cover all the organizations that have a squid in a multi-organization site (such as the ATLAS Midwest Tier2), with the closest squid first

Dave not connected.

Traceability and Isolation WG

Theme: Lightweight sites' survey results by Maarten

Maarten walked through the slides, which are linked from the agenda, with the key dates of the survey and reminders. 51 sites responded so far. Two sites responded twice with different contents each time. Two T1s covered also the T2s they represent. Some T3s and/or sites with potential participation in WLCG ("tourist" sites) also replied. The questionnaire is still open and will remain for a few more weeks. There is at least one site that has a very different configuration from others, e.g. no CE, only based on VMs. Some questions may have been misunderstood by the sites, though each question came with some explanatory context. Kudos to the listed sites for taking the survey! The details of individual answers will not be publicly disclosed. Puppet is not used everywhere, e.g. some sites prefer Ansible. Both Marias said the WLCG Ops portal can be used (comment on slide 7 about 'Improve documentation') to carry links to the existing locations (twikis or other pages). The removal of obsolete documentation is on Maarten's virtual to-do list. NB! Previous survey led to a request from sites for an entry point to documentation, this is why we developed the wlcg-ops portal. See https://twiki.cern.ch/LCG/WLCGSiteSurvey

Theme: HTCondor Accounting by John Gordon

John walked through the slides, which are linked from the agenda. Julia asked who coordinates the different activities. John said nobody. It would be good to form a Task Force for following this. Julia and Maria A suggested this issue should be discussed offline as this is not only an Accounting issue but a more general HTCondor deployment follow-up. Pepe asked when this work will start so that PIC can adopt the same solution. They will be in touch offline as well.

Theme: Sites data in the new monitoring (MONIT) by Alberto

Alberto walked us through the slides, which are linked from the agenda. What is shown in the Unified Monitoring slides is available already or, for some parts, very soon. The variety of tools adopted (kibana, grafana etc.) was needed due to their complementarity in functionality. John asked if the FTS data come from sites other than the FTS server at CERN. The answer is yes it comes from all the FTS servers and overs all the WLCG transfers. To have access one has to have a CERN login. Alberto asks for volunteer site managers for making such dashboards. Alessandra asked for a simple view for sites so that kibana knowledge is not needed. Maarten suggested that ATLAS and CMS, the main dashboard users, ask a couple of their sites to assist Alberto in configuring standard dashboards to spare other users the kibana internals. Stephan asked for some standard views, which would be useful for CMS. Julia said the Site Status Board requires a lot of data aggregation. Alberto said SAM aggregations will be started towards end of Q1 2017.

Action list

Creation date Description Responsible Status Comments
01.09.2016 Collect plans from sites to move to EL7 WLCG Operations On-going The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF will stay on SL6 for now but they plan to go directly to EL7 early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress.
03.11.2016 Review VO ID Card documentation and make sure it is suitable for multicore WLCG Operations pending searching the doc location and editor. Is it in the EGI wiki?
03.11.2016 Discuss internally on how to follow up long term strategy on experiments data management as raised by ATLAS WLCG Operations Pending  
03.11.2016 Check status, action items and reporting channels of the Data Management Working Group WLCG Operations Pending Julia gives an update of behalf of Oliver

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
29.04.2016 Unify HTCondor CE type name in experiments VOfeeds all - Proposal to use HTCONDOR-CE. In progress for ALICE. Raja will ask the status for LHCb.   Ongoing
03.11.2016 Proposal for advance warning of long site downtimes All - Proposal from WLCG Ops ready. Please, check and give feedback 5th January 2016 Proposal DONE. Waiting for feedback from Experiments
01.12.2016 Open tickets to sites for moving to FTS3 client     There are PhEDEx prerequisites Year End 2016 January 2017

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
01.12.2016 Proposal for advance warning of long site downtimes All - Please, give feedback to this proposal 5th January 2017 In progress

AOB

-- MariaDimou - 2016-11-17

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx LongShutdowns.pptx r1 manage 1678.3 K 2016-12-01 - 15:32 MariaALANDESPRADILLO  
Edit | Attach | Watch | Print version | History: r48 < r47 < r46 < r45 < r44 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r48 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback