WLCG Operations Coordination Minutes, July 7th 2016

Highlights

  • Good development work completed by the Squids TF (details). A collaboration with the Info Sys TF is starting in order to discuss CRIC - Squid functionality issues.
  • IPv6 is now recommendable for deployment by the sites. The next MB will mandate a specific deployment action plan. Clear instructions will be included in the document being finalised by the HEPiX IPv6 WG in consultation with all services.

Agenda

Attendance

  • local: Maria Dimou (minutes), Maria Alandes (chair), Maarten Litmaath, Jerôme Belleman, Andrea Manzi, Marian Babik, Andrea Sciabà, Renato Santana.
  • remote: Ulf Tigerstedt, Javier Sanchez, Hung-Te Lee, Almudena (UAM - Madrid ATLAS T2), Jose del Peso, Nurcan Ozturk, Di Qing, Stephan Lammel, Thomas Hartmann, Christoph Wissing, Antonio Maria Perez Calero Yzquierdo, Dave Mason, Aresh, Alessandra Doria, Gareth Smith, Rob Quick, Pepe Flix, Renaud Vernet, Andreas Petzold, Catherine Biscarat, Dave Dykstra.

Operations News

  • No meeting in August
  • Next meetings:
    • 1st September
    • 29th September (As 6th October is too close to CHEP)
  • WLCG workshop taking place on 8-9 October, co-located with CHEP 2016. Draft Agenda under discussion. More details in September
    • Very likely to dedicate the topic discussion of the next Ops Coord meetings to prepare for the workshop

Middleware News

  • Useful Links:
  • Baselines/News:
  • Issues:
    • Series of issues affecting CMS transfers out of T0. Apart from ops issues, from the mw side, a problem has been discovered in FTS, which is not correctly filling a specific SRM put request parameter needed by dCache (so the default value of 1 hour is used). This has caused problem with long lasting transfer. For now a configuration change has fixed the issue at CERN/RAL ( and BNL upgraded as well as the same problem could affect ATLAS). FNAL cannot apply the same conf change as they are running a very old version of FTS + gfal2. An upgrade of FNAL FTS should be scheduled.
    • GGUS:120586 concerning an issue with glite-ce* clients with dual stack CEs. A fix is going to be provided tomorrow by CESNet.
  • T0 and T1 services
    • BNL
      • FTS upgraded to v 3.4.6
    • CERN
      • Top BDII upgraded to SLC 6.8
      • FTS upgraded to v 3.4.6 ( upgrade to 3.4.7 planned next week)
    • CNAF
      • Updated gpfs version (to 4.1.1-7) on cms cluster and gridftp/xrootd server reinstalled for OS upgrade
    • KISTI
      • Xrootd upgraded from v 3 to v4.
      • xrootd for tape buffer to be upgrade to v4.3.0 (an issue concerning frm_purge preventing this)
    • NDGF
      • dCache upgraded to 2.16.4 last week
    • RAL
      • FTS upgraded to v 3.4.6
    • TRIUMF
      • upgraded dCache 2.13.31

Tier 0 News

Highlights

Traffic between CERN and Wigner continues to be above 100 Gb/s, occasionally reaching the 200 Gb/s limit. The local loop for the third link is still being clarified.

Experiments are sending unprecedented volumes of data into the Tier-0; in June the data from the LHC experiments amounted to 10 PB.

Large-scale disk retirement (with target dates between mid July and the end of 2016) is continuing. So far the hardware for Hadoop and many (relatively) small pools in CASTOR and EOS was consolidated. For the largest retirement (ALICE CASTOR), replacement hardware is being commissioned; a few weeks of co-existence with the nodes to be retired will automatically drain out the latter.

A new, less invasive, procedure for deploying OS updates on nodes hosting Oracle databases was introduced. The patching takes place while the databases are fully functional; rolling service interruptions are limited to the duration of machine reboot only. As a result of this new procedure, the intervention time will be shortened from an hour to 15 minutes (on average) per database node.

On CMS' request, 3'000 cores of their pledge have been transferred to the CMS Compute Cloud for their Tier-0 processing. 1'000 cores have been added to the ATLAS Tier-0 (LSF-based) facility, which is entirely based on nodes with SMT disabled due to their memory requirements.

The procurement of commercial IaaS resources had met major issues with the APIs the supplier had proposed. Following CERN's feedback, the supplier promptly and successfully worked on resolving the issues. 4'000 cores are now being deployed and will be used for three months.

Issues

A batch of fibre connection interfaces has been identified as causing very low rates of packet loss, which could affect the data transfers between CERN and Wigner. These interfaces are being replaced.

Some instabilities have been observed on ATLAS storage, which requires deploying existing patches (they are already in use for the other experiments).

Long-lasting data transfers for CMS data export flowing through the gridftp doors were experiencing issues that became worse due to increased data taking rates. The reason has been found, and fixes have been deployed. The accumulated data backlog to be exported to the Tier-1s will start to be absorbed.

Plans

A set of Oracle security patches is expected to be available on July 19th, the relevance of which will be assessed.

40'000 cores are in the process of being added as compute capacity. 2'500 cores will go to HTCondor, the rest to LSF. The capacity is partly used to replace kit to be retired; partly it serves to provide additional flexibility. CERN's pledges for 2016 remain unchanged.

Local job submission to HTCondor is being tested and documented; users will soon be invited to test.

Maria A. asked why the new HTCondor nodes are so few if this is the choice for the future. Answer by Jerôme in email: 2'500 additional cores for HTCondor is a conservative estimate. Checks on grid versus local submission, occupation and efficiency are currently going on; as a result, we expect that more resources may be moved into HTCondor.

Tier 1 Feedback

PIC/IFAE

(1)
https://indico.cern.ch/event/438205/contributions/2202194/attachments/1292213/1926345/2016-06-16_Herding_the_protocol_zoo.pdf
https://docs.google.com/spreadsheets/d/1NIi2OWvQzDzATmv3RNkXbQ-Yw9pFtTC1HvMkzyzbonU/edit#gid=0
https://ggus.eu/index.php?mode=ticket_info&ticket_id=122054
https://its.cern.ch/jira/browse/FAX-146

(2)
https://twiki.cern.ch/twiki/bin/view/AtlasComputing/FAXdCacheN2Nstorage
https://twiki.cern.ch/twiki/bin/view/AtlasComputing/GlobalLogicalFileName
https://twiki.cern.ch/twiki/bin/view/AtlasComputing/UsingFAXforEndUsers
https://twiki.cern.ch/twiki/bin/viewauth/AtlasComputing/FaxOperationsGuide

Tier 2 Feedback

Experiments Reports

ALICE

  • High activity on average
  • CERN: intermittent CASTOR errors were addressed by the CASTOR team

ATLAS

  • The Tier-0 is experiencing a backlog due to unexpectedly high data taking rate. One run was reco’ed on the grid successfully (spillover). A second run is being processed now.
  • File transfers from CERN EOS is slow currently, affecting the spillover run, being checked with EOS people.
  • EOS namespace crashed and restarted twice in the week of June 21st.
  • RRC-KI-T1 lost storage DB on June 8th : 37k unique files lost out of 131k lost files. Will be regenerated.
  • Release 21.0.2 is being validated (currently has a software bug). MC16 simulation will run with Release 21.

CMS

  • Data taking continues at full scale, putting the infrastructure under some stress because the LHC uptime is larger than initially estimated (59% of last week LHC had stable beams)
  • Data transfer backlog at CERN
    • about 2.7 PB of backlog to transferred out
    • backlog occupies space on CERN EOS, reducing available space
    • we are deleting old datasets (mostly at Fermilab) and trying to tune FTS parameters to increase transfer rates
  • EOS Tier-0 transfer issue
    • What CERN sees is saturated network interfaces on the 4 gridftp doors for the T0 EOS instance (40 Gbps = 5 GB/s), although CMS sees 1 GB/s max
    • discrepancy is very likely due to 'lost' bandwidth in failed transfers
    • CERN added two additional gridftp doors as temporary solution
    • Identified and increased internal timeout settings in EOS, FTS, and dcache
    • suspended reconstruction output in favour of analysis dataset transfers
    • TOR switch at Wigner dropping packets at high load identified this morning
  • Started RE-RECO of 2016 data
  • Thanks to help from CERN IT the VO-feed is updated switching the type of Condor CEs, as agreed among the experiments.
  • CNAF: needed to put a downtime for the CMS SRM interface and couldn't (technically) also set a downtime for the CMS queues, it is not foreseen that a service that is used by multiple VOs can be set into downtime for a single VO

ATLAS reported also having been affected by the TOR switch in Wigner.

LHCb

  • "Smooth" Data taking, even more than expected. Stress in the system, without major problems.
  • Reco, Stripping, MC and users jobs running "fine" in the grid.

The old DIRAC portal was still referred to for the LHCb VO feed, but that had already been corrected by the time of the meeting. The HTCondor CEs seemed not to be correctly included yet.

During the meeting a celebrating email came in reporting record LHCb luminosity of 500 pb-1!

Ongoing Task Forces and Working Groups

Accounting TF

Julia being on holiday, this report will appear in these minutes on her return.

Information System Evolution


  • An IS TF meeting took place on 16.06.2016:
    • EGI presented their main motivations to keep on relying on the BDII. There were discussions on which areas would be affected if WLCG stops relying on the BDII. Some of them are: MW upgrades and EGI 2nd line support. If WLCG finally decides to stop BDII, the impact of this needs to be better understood.
    • There was a proposal to drop capacity views from REBUS. There was agreement within the TF to do this. It was decided to present this at the MB for official green light. However, after discussions with I. Bird, it was decided not to do anything for the time being until the new IS is in place. It was agreed that since REBUS capacity known issues have been documented, if sites open tickets complaining about wrong values, they won't be fixed for the time being and sites will be pointed to the known issues page.
  • CRIC evaluation for CMS was over and CMS decided to engage further in the project. Developers and CMS people are now discussing the next steps.
  • A GDB presentation to report about the status of the TF and CRIC is scheduled for the 13.07.2016.

IPv6 Validation and Deployment TF


This was today's dedicated theme. See notes and presentations further down.

Machine/Job Features TF

Middleware Readiness WG


The 18th meeting of the WG took place yesterday. Full minutes here. Summary:

  • Old and inactive tickets in jira will be closed by the MW Officer after last verification with Product owners and/or Volunteer site managers. See here which ones.
  • We need to know about upcoming MW products' releases, we also have a permanent poll open for new Volunteer sites especially for MW Readiness verification on CentOS7. See here what we know so far for the near future.
  • The agenda topic on WG Mandate review and meeting frequency discussion was postponed due to lack of participants in this meeting. For more people joining this effort, we hope to get some publicity at the WLCG Workshop in October.
  • Due to summer holidays, WLCG Workshop & CHEP preparations and aftermath, the proposed date for the next meeting is Wed Nov. 2nd @ 4pm CET. Please email the e-group of the WG for comments.

Network and Transfer Metrics WG


  • WG update presented at ATLAS TIM including discussion on the mid-long term network evolution
  • pre-GDB on networking focusing on the mid-long term network evolution will be held in December
  • North American Throughput meeting held on 22nd of June:
    • Andy Lake presented new features planned in perfSONAR 4.0
    • Next meeting end of July, main topic: pScheduler (replaces bwctl)
  • WLCG Throughput meeting held 16th of June:
    • Main topic was re-organization of the meshes, the proposal was agreed and implemented
    • New experiment-based meshes were introduced in the production dashboard in effect (see twiki)
    • Next meeting in Sept. (co-located with LHCOPN/LHCONE)
  • WLCG Network Throughput Support: Several new cases were reported and are being followed, see twiki for details.
  • perfSONAR 4.0 (formerly 3.5) RC to become available end of August, WLCG validation and deployment campaign will follow.
    • Introduces several major changes such as new configuration management and interface as well as migration from BWCTL to pScheduler

RFC proxies

  • EGI and OSG have agreed to the default proxy type being made RFC
    • a new version of the VOMS client rpm will be provided (GGUS:122269)
    • a new version of YAIM core for myproxy-init is available
      • temporary location: here
    • the intention is for the UMD to be updated when both are available

Squid Monitoring and HTTP Proxy Discovery TFs

  • Much progress since last WLCG Ops meeting
    • Squid monitoring now has a fully functional exceptions specification file on the wlcg-squid-monitor.cern.ch server, enabling the MRTG plots to stably monitor round-robin addresses even at CERN where individual servers taken out of service are removed from the round-robin.
      • The list of squids generated from GOCDB/OIM registrations plus the exceptions now includes IP addresses that stay stable until they get replaced by a new machine in a round-robin.
      • The only remaining item to do is to preserve old entries when a DNS failure happens
    • There is now a http://wlcg-wpad.cern.ch/wpad.dat service based on the above list of squids mapped to GeoIP organizations, but it turned out to be of very low quality. There are many overlapping sites in the same GeoIP organization with registered squids (see the site "altnames" in the GeoIP-org-sorted list of squids for details), but more importantly we came to realize that we can't by default make registered squids known for opportunistic use by CMS or ATLAS for Frontier, because they may be engineered for the relatively light traffic of CVMFS and we wouldn't want to overload them.
      • Instead, the plan now is to generate the auto discovery list from those squids registered for Frontier separately by ATLAS (in AGIS) and CMS (in SITECONF files) and to cross-check them against the GOCDB/OIM registrations to look for inconsistencies.
      • Multi-site GeoIP organizations are going to have to also specify a range of IP addresses applicable to each site, probably in the ATLAS & CMS frontier squid registrations.
    • Fermilab now has a pair of 10gbit/s servers for external proxy services in production just like CERN's. They serve four purposes:
      1. CMS Frontier backup proxies -- for CMS sites with failing squids -- monitored for failover
      2. CMS Opportunistic squids -- for CMS to use at opportunistic sites that do have usable squids of their own
      3. CVMFS backup proxies -- for all CVMFS uses with failing squids -- one day to be monitored for failover
      4. LHC@home proxies -- for all LHC@Home sites
    • In addition to http://lhchomeproxy.cern.ch/wpad.dat listing geo-sorted LHC@home proxies there is now a http://cmsopsquid.cern.ch/wpad.dat listing geo-sorted cmsopsquid proxies. The latter is for CMS to use if they don't find any usable squids in http://wlcg-wpad.cern.ch/wpad.dat.

Dave said that obtaining data from French sites and CNAF is problematic. The TF also needs to have an update on experiment-specific Info Systems and the upcoming CRIC. Will wpad be needed in the future? Maarten advised caution with the publicity of the squids, to avoid them getting overloaded. Marian shared similar experience with GeoIP proximity service in PerfSONAR in the past, which led the network WG to the decision not to use it. Dave replied that GeoIP is ok for CVMFS as there are few of them.

Dave asked the dates of the Ops Coord meetings well in advance. The answer is here http://indico.cern.ch/category/4372/. Usual algorithm first Thursday of the month with occasional exceptions for practical reasons. He also asked for the Traceability WG to report to such meetings.

Theme: IPv6

Andrea S. presented the document being discussed at the same time in a dedicated meeting slides by Andrea and Pepe the relevant deployment status at PIC slides by Pepe, an experience that can be useful to other sites as well. NDFG is also advanced with IPv6 and they will give a presentation next time.The IPv6 TF will assemble these reports in their web pages.

During the discussion Maarten emphasised the need to keep IPv4 on the WNs as long as there are storage systems not yet running IPv6. The sites can easily maintain a dual stack for a long time. He also said, during Pepe's presentation, that Torque is still supported, so it should be fully IPv6-compliant at some point, but Maui is not, so if it doesn't work in a dual-stack environment, this could be a problem for the sites. Pepe said the batch server (which runs Maui) might be kept on IPv4 while the WNs are dual stack. Maarten said the WN IPv4 addresses may be private if the site cannot or does not want to use public addresses for them (as has been the case since many years). For PIC this potential problem goes away as they plan to move to HTCondor. The IPv6 TF needs to give guidelines to other sites which run Torque with Maui.

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
29.04.2016 Unify HTCondor CE type name in experiments VOfeeds all - Proposal to use HTCONDOR-CE. Waiting only for ALICE and LHCb now.   Ongoing

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion

AOB

Maria D. reported the recent availability of short video tutorials on lhc at home (boinc installation and configuration). See the videos for each OS on http://lhcathome.web.cern.ch/join-us or the CERN https://twiki.cern.ch/Edutech

-- MariaDimou - 2016-07-05

Edit | Attach | Watch | Print version | History: r44 < r43 < r42 < r41 < r40 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r44 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback