WLCG Operations Coordination Minutes - March 5th, 2015

Agenda

Attendance

  • local:

  • remote:

Operations News

Middleware News

  • Baselines:
    • FTS 3.2.32 : Important fixes for Activity Shares
    • dCache, 2.6.40, 2.10.18, 2.11.9 : various bug fixes and possible vulnerability on FTS doors
      • dCache 2.6.x end of support is 06/2015

  • T0 and T1 services
    • CERN
      • FTS upgraded to v 3.2.32
      • CASTORALICE upgraded to v 2.1.15
    • BNL
      • dCache upgraded to 2.10.20 from 2.6.18
      • Xrootd upgaded to 4.1.1 and dev version
      • FTS upgraded v.3.2.32
    • FNAL
      • StoRM upgraded to 1.11.7 for ATLAS
      • XRootd upgraded to 4.1.1 for ATLAS
    • IN2P3
      • dCache upgraded to 2.10.18
      • Xrootd upgraded to 4.1.1 for FAX
    • JINR-T1
      • FTS upgraded to v 3.2.32
    • NDGF
      • dCache upgraded to 2.12.0 ( early adopter not yet officially released)
    • PIC
      • dCache planned to be upgraded to 2.10.20 ( or latest available) on the 10th of March
    • RAL
      • FTS upgraded v.3.2.32

  • TarBall WN and UI status and Maintenance, regular installation in CVMFS
    • We would like to install in CVMFS latest version of tarball UI and WN
    • it's unclear the status of the support and who is working on assembling the tarballs
    • interest from IN2P3 to test new versions of the tarball when available

Tier 0 News

  • LHCb LFC migration: Done on Monday.
    • the only LFC instance left is the "shared" one; from next one we will be monitoring the usage to retire it at the same time than the LHCb one, and fully decommission the LFC service at CERN
  • VOMRS decommissioning and replacement by VOMS-admin: done on Monday March 2nd.
    • The intervention went more or less well; a few certificates went missing during the migration. We have recovered most of them but it's still work in progress.
    • The experiments are still adapting to the new interface so they report lots of things/issues that we need to explain to them.
    • In overall, things look OK and it's getting better each day.
  • FTS3 upgraded to 3.2.32-1 on March 3rd

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • high activity
  • successful T1-T2 Workshop in Torino, Feb 23-25
    • thanks to the organizers and the participants!
    • many useful presentations by and/or for ALICE site admins!
  • VOMRS to VOMS-Admin migration
    • ALICE can work, but some issues were observed and reported in the long ticket (GGUS:110227)
    • we thank Alberto and Andrea C. for their continued efforts in this matter!

ATLAS

  • Cosmic data taking (M8) ongoing smoothly.
  • P1 to EOS and Castor tested. Various interations done, situation now is good.
  • MC15 just started: simul and then (digi+reco). 4B events (1PB of total disk space), approx 2 months of workload for the whole grid.
    • ATLAS consider it as the beginning of Run2. Sites please plan carefully your maintenance.
  • Networking: we experienced issues with transatlantic connection. it's useful to have the information propagated to the experiments , discussing this with the networking and transfers task force.

CMS

  • Ongoing activities
    • Cosmics with main magnet off (CRUZET)
    • Mainly production of Upgrade MC
    • Moderate load in the system
  • Tape staging test at Tier-1 sites
    • Finish successfully: CNAF, PIC, CCIN2P3,RAL
    • Almost finished FNAL
    • KIT: Used data set for the test was on very old equipment
      • Had some issues
      • Will be repeated with data on recent equipment
  • VOMRS Migration
    • Feature to delegate approval to national/regional representatives not yet active (configured)
  • Migration to a single global Condor pool for Analysis and Production done
    • Tier-2 basically stopped receiving jobs with VOMS role production
    • 80% of the fair share should be allocated to VOMS role pilote (will server Analysis and production)
    • 10% for VOMS role production for some legacy
    • 10% for any other roles/groups
    • Updated CMS Policies twiki and VO card accordingly
    • No changes in the Tier-1 fair share configuration yet!

Discussed in the last meeting about CMS tape staging test. Here are some numbers, one should note though

  • Test done with reading - assumes same performance for writing
  • Fraction is about the fraction in WLCG Tape pledge
    • Actual distribution depends on assignment of primary datasets
  • Assume no big interference with other VOs
  • CMS does plans only on moderate staging from tape during data taking

Site Pledged Tape(PB) Share Expected Rate(MB/s)
FNAL 30 43% 650
CNAF 10 14% 210
JINR 5 10% 150
KIT 7.5 10% 150
RAL 6 9% 135
IN2P3 5.6 9% 135
PIC 3.8 5% 75

LHCb

  • Operations
    • "Run1 Legacy stripping" finished; Data validation
  • VOMRS migration. Some problems under investigation.
  • Other
    • FC Migration from LFC to DFC (DIRAC File Catalog) finished successfully. Small issues should be fixed in coming patch release.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • gLExec in PanDA:
    • testing campaign is covering 63 sites so far (+9)
      • including almost all of the big sites in EGI, plus BNL in OSG
      • a few sites are being debugged, the rest are OK

RFC proxies

  • there will be a presentation about RFC proxies in next week's GDB
  • all MW should already support them since a long time
    • we got that "for free" with the SHA-2 deployment
  • what is the status of RFC proxy usage per experiment?
    • ALICE are switching WLCG VOBOXes at their sites to RFC proxies; should soon be done
    • ATLAS are using RFC proxies to some extent (?)
    • CMS users are using RFC proxies since months
    • LHCb ?
  • SAM-Nagios needs an easy fix in the proxy renewal sensor code to support RFC proxies
  • what else?
  • the idea is to make RFC proxies the default later this year
  • to create RFC proxies today:
    • voms-proxy-init -rfc .....
    • myproxy-init needs GT_PROXY_MODE=rfc in its environment

Machine/Job Features

Middleware Readiness WG

Multicore Deployment

The first objective of the TF, that is, the understanding of the principles for a successful shared use of the common resources in multicore mode by ATLAS and CMS has been achieved. The initial deployment to the involved sites (T1s for CMS, T1s+T2s for ATLAS) has also been successful and the most popular batch systems' capabilities concerning the use of multicore jobs have been discussed as well.

However, at this point both experiments are working independently on their respective infrastructures. We therefore propose to keep the TF open in "passive mode" while this is ongoing, in order to review the status once both experiments have advanced in their respective models, and in case common matters need to be discussed during this period.

IPv6 Validation and Deployment TF

Squid Monitoring and HTTP Proxy Discovery TFs

  • No updates to report

Network and Transfer Metrics WG

  • WG meeting was held on 18th of February (https://indico.cern.ch/event/372546/)
  • All sites should be running 3.4.1, final deadline was 16th of February, 5 sites received tickets (2 of them responded)
  • Follow up campaign to bring all perfSONARs to the correct configuration ongoing, started with LHCOPN/LHCONE instances, several issues found and reported
  • Testbed established to evaluate/test 3.4.2rc (release candidate), which was released last week. Several issues fixed that were reported by us during LHCOPN/LHCONE configuration campaign. One new issue found and reported to the development team.
  • New meshes: IPv6/IPv4 dual stack (lead by Duncan Rand), Latin America (lead by Renato Santana, Pedro Diniz)
  • Testing and evaluation of the pilot instances for esmond/maddash ongoing (psds.grid.iu.edu, psmad.grid.iu.edu)
  • Production instance of the infrastructure monitoring (psomd.grid.iu.edu) updated with new tests that check completeness/freshness of data in the local measurement archives (high level functional test)

  • Integration of the network and transfer metrics: two pilot projects proposed in the last WG meeting
  • LHCb pilot project to provide experiment agnostic prototype to access central datastore (esmond) and publish available metrics to messaging
  • Extending ATLAS FTS performance study to CMS and LHCb

  • Networking degradation between SARA and AGLT2 under investigation - to be followed up at the next WG meeting
    • Original issue noted when many large file transfers SARA->AGLT2 failed. Cause was FTS timeout since files 2-6GB were moving at 10-100s of Kbytes/sec. Problem reported to this working group.
    • perfSONAR regular tests between T2 and T1 have been paused so manual perfSONAR tests were done showing poor performance (200-500 Kbytes/sec).
    • Saul Youssef's examination of FTS logs indicated possible problematic trans-Atlantic link was involved. Additional reports of poor performance between CERN EOS and MWT2 used same link.
    • Recommended procedure (by LHCONE/LHCOPN working group) is to have either end-site contact their R&E network provider to open a ticket. AGLT2 contacted Internet2 and opened a ticket (ISSUE=2688 PROJ=144)
    • Temporary debug mesh setup to test paths between SARA, CERN and AGLT2,MWT2. See https://maddash.aglt2.org/maddash-webui/index.cgi?dashboard=Debug%20Mesh%20(temp)
    • Internet2 has opened ticket with GEANT(TT#2015022734000453) and the issue is actively being pursued.
      • Work underway getting suitable intermediate perfSONAR instances onto LHCONE to help localize the issue.
  • Next WG meeting will be on 18th of March (https://indico.cern.ch/event/379017/)

Action list

  • ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Status: HT-Condor CE tests enabled in production on SAM CMS; sites publishing sam_uri in OIM will be tested via HTCondor (all others via GRAM). Number of CMS sites publishing HTCondor-CE is increasing. For ATLAS, Alessandro announced that with the USATLAS experts a solution has been found to publish the HTCondor CEs in the BDII and OIM in a way that satisfies both ATLAS and SAM needs. It's not a long term solution, but it should be good for the next six months. Before closing the action though we need a confirmation by Marian.
    • Ongoing discussions on publication in AGIS for ATLAS.
  • ONGOING on experiment representatives - report on voms-admin test feedback
    • Experiment feedback and feature requests collected in GGUS:110227
  • NEW on Oliver and the experiment representatives - HTTP deployment task force
    • The experiments should give their position of the need and the mandate of the task force

AOB

  • The next meeting is on March 19th.

-- AndreaSciaba - 2015-03-03

Edit | Attach | Watch | Print version | History: r21 | r19 < r18 < r17 < r16 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r17 - 2015-03-05 - ChristophWissing
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback