LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes150305 (2018-02-28, MaartenLitmaath)

EditAttachPDF

WLCG Operations Coordination Minutes - March 5th, 2015

Agenda

https://indico.cern.ch/event/378502/

Attendance

local: Andrea Sciabà (minutes), Alessandro Di Girolamo (ATLAS), Andrea Manzi (IT-SDC), Stefan Roiser (LHCb), Maarten Litmaath (ALICE), Prasanth Kothuri (IT-DB), Oliver Keeble (IT-SDC)

remote: Pepe Flix (chair), Thomas Hartmann (KIT), Michael Ernst (BNL), David Cameron (ATLAS), Renaud Vernet (IN2P3-CC), Ulf Tigerstedt (NDGF), Vladimir Romanovskiy (LHCb), Alessandra Forti (Manchester), Christoph Wissing (CMS), Maite Barroso (Tier-0), Alessandra Doria (Italian T2s), Antonio Perez Calero Yzquierdo (CMS), Di Qing (TRIUMF), David Mason (FNAL), Jean-Michel Barbet (IN2P3), Rob Quick (OSG), Shawn McKee, Jeremy Coles (GridPP), Gareth Smith (RAL), Catherine Biscarat (France)

Operations News

Analyzing the survey data, preliminary results will be shown at the GDB. The final results will be presented at Okinawa. Of course, site confidentiality will be always ensured.

Middleware News

Useful Links:

Baselines:
- FTS 3.2.32 : Important fixes for Activity Shares; all sites but one have already upgraded
- dCache, 2.6.40, 2.10.18, 2.11.9 : various bug fixes and possible vulnerability on FTS doors
  - dCache 2.6.x end of support is 06/2015

MW Issues:
- RAL reported a possible problem upgrading gfal2 in their WN ( conflict with gfal). https://ggus.eu/?mode=ticket_info&ticket_id=112107, waiting for details
- Freak vulnerability : http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2015-0204, to evaluate the impact in WLCG

T0 and T1 services
- CERN
  - FTS upgraded to v 3.2.32
  - CASTORALICE upgraded to v 2.1.15
- BNL
  - dCache upgraded to 2.10.20 from 2.6.18
  - Xrootd upgraded to 4.1.1 and dev version
  - FTS upgraded v.3.2.32
- FNAL
  - StoRM upgraded to 1.11.7 for ATLAS
  - XRootd upgraded to 4.1.1 for ATLAS
- IN2P3
  - dCache upgraded to 2.10.18
  - Xrootd upgraded to 4.1.1 for FAX
- JINR-T1
  - FTS upgraded to v 3.2.32
- NDGF
  - dCache upgraded to 2.12.0 ( early adopter not yet officially released)
- PIC
  - dCache planned to be upgraded to 2.10.20 ( or latest available) on the 10th of March
- RAL
  - FTS upgraded v.3.2.32

TarBall WN and UI status and Maintenance, regular installation in CVMFS
- We would like to install in CVMFS latest version of tarball UI and WN
- it's unclear the status of the support and who is working on assembling the tarballs
- interest from IN2P3 to test new versions of the tarball when available

After the meeting Jeremy sent a clarification on the tarball WN version in CVMFS. Matt Doidge, in a UK meeting this morning, said:

The latest version of the WN hadn't made its way into CVMFS yet - it used to be that the CVMFS developers unpacked the tarball into the grid.cern.ch CVMFS repository for me but recently Jakob gave me access to the repository so I can do it all in-house. I was going to wait until after this week's workshop before trying to upload anything though.

Therefore, a new version should be uploaded within the next week.

Tier 0 News

LHCb LFC migration: Done on Monday.
- the only LFC instance left is the "shared" one; from next one we will be monitoring the usage to retire it at the same time than the LHCb one, and fully decommission the LFC service at CERN
VOMRS decommissioning and replacement by VOMS-admin: done on Monday March 2nd.
- The intervention went more or less well; a few certificates went missing during the migration. We have recovered most of them but it's still work in progress.
- The experiments are still adapting to the new interface so they report lots of things/issues that we need to explain to them.
- In overall, things look OK and it's getting better each day.
FTS3 upgraded to 3.2.32-1 on March 3rd

Stefan added that the LFC-to-DIRAC migration was complex but went very well and asked to keep the LFC around for a few months to make 100% sure that everything is fine; Maite agreed to it.

Maarten asked Maite to contact EGI operations to understand their needs about the "shared" LFC.

Alessandro suggested to provide user documentation for VOMS-Admin covering the most important use cases for the LHC VOs.

Tier 1 Feedback

IN2P3-CC

The downtime at IN2P2-CC scheduled last Tuesday had to be extended to this morning because the batch system was not working correctly and had to be rolled back.
the dCache postGRES database will have to be upgraded to solve a problem with the space manager; it will happen next week, the downtime will be half a day and the experiments will be contacted.

Tier 2 Feedback

Experiments Reports

ALICE

high activity
successful T1-T2 Workshop in Torino, Feb 23-25
- thanks to the organizers and the participants!
- many useful presentations by and/or for ALICE site admins!
VOMRS to VOMS-Admin migration
- ALICE can work, but some issues were observed and reported in the long ticket (GGUS:110227)
- we thank Alberto and Andrea C. for their continued efforts in this matter!

ATLAS

Cosmic data taking (M8) ongoing smoothly.
P1 to EOS and Castor tested. Various interactions done, situation now is good.
MC15 just started: simul and then (digi+reco). 4B events (1PB of total disk space), approx 2 months of workload for the whole grid.
- ATLAS consider it as the beginning of Run2. Sites please plan carefully your maintenance.
Networking: we experienced issues with transatlantic connection. it's useful to have the information propagated to the experiments , discussing this with the networking and transfers task force.

Pepe asked how ATLAS discovered the network issues: Alessandro answered that it was from the transfer performance monitoring.

CMS

Ongoing activities
- Cosmics with main magnet off (CRUZET)
- Mainly production of Upgrade MC
- Moderate load in the system
Tape staging test at Tier-1 sites
- Finish successfully: CNAF, PIC, CCIN2P3,RAL
- Almost finished FNAL
- KIT: Used data set for the test was on very old equipment
  - Had some issues
  - Will be repeated with data on recent equipment
VOMRS Migration
- Feature to delegate approval to national/regional representatives not yet active (configured)
Migration to a single global Condor pool for Analysis and Production done
- Tier-2 basically stopped receiving jobs with VOMS role production
- 80% of the fair share should be allocated to VOMS role pilot (will server Analysis and production)
- 10% for VOMS role production for some legacy
- 10% for any other roles/groups
- Updated CMS Policies twiki and VO card accordingly
- No changes in the Tier-1 fair share configuration yet!

Discussed in the last meeting about CMS tape staging test. Here are some numbers, one should note though

Test done with reading - assumes same performance for writing
Fraction is about the fraction in WLCG Tape pledge
- Actual distribution depends on assignment of primary datasets
Assume no big interference with other VOs
CMS does plans only on moderate staging from tape during data taking

Site	Pledged Tape(PB)	Share	Expected Rate(MB/s)
FNAL	30	43%	650
CNAF	10	14%	210
JINR	5	10%	150
KIT	7.5	10%	150
RAL	6	9%	135
IN2P3	5.6	9%	135
PIC	3.8	5%	75

Pepe asked if CMS could add to the table the observed rates; Christoph will consider it; anyway, so far sites have met their targets.

LHCb

Operations
- "Run1 Legacy stripping" finished; Data validation
VOMRS migration. Some problems under investigation.
Other
- FC Migration from LFC to DFC (DIRAC File Catalog) finished successfully. Small issues should be fixed in coming patch release.

Feedback on proposal for an HTTP deployment task force.

Stefan announced the support for the initiative and that a member of LHCb has been appointed to participate.

Christoph said that the initiative is interesting for CMS though CMS needs are not as advanced as for the other experiments. He thinks that a working group would be more appropriate. Nobody has been yet appointed to participate but a candidate exists and he should be able to start working on it from June.

Maarten said ALICE are not interested.

Alessandro said that ATLAS is already practically contributing to this kind of activity and the person involved will continue. If needed more names can be found; ATLAS supports the initiative.

Michael and Pepe expressed the interest of BNL and PIC in participating. Oliver clarified that there should be participation from sites, as the verification of the HTTP setup is a very typical question.

Given the overall positive feedback, Oliver will start the process for the HTTP deployment task force creation and will prepare the mandate, the goals, and a mailing list.

The general agreement is to have a task force rather than a working group to be able to work with what exists today in a reasonably short time scale.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

gLExec in PanDA:
- testing campaign is covering 63 sites so far (+9)
  - including almost all of the big sites in EGI, plus BNL in OSG
  - a few sites are being debugged, the rest are OK

Even if there is no strong deadline, it should be good to finish the testing campaign by spring.

RFC proxies

there will be a presentation about RFC proxies in next week's GDB
all MW should already support them since a long time
- we got that "for free" with the SHA-2 deployment
what is the status of RFC proxy usage per experiment?
- ALICE are switching WLCG VOBOXes at their sites to RFC proxies; should soon be done
- ATLAS are using RFC proxies to some extent (?)
- CMS users are using RFC proxies since months
- LHCb ?
SAM-Nagios needs an easy fix in the proxy renewal sensor code to support RFC proxies
what else?
the idea is to make RFC proxies the default later this year
to create RFC proxies today:
- voms-proxy-init -rfc .....
- myproxy-init needs GT_PROXY_MODE=rfc in its environment

Alessandro will check if the issues which existed in the past and made ATLAS keep using legacy proxies are still there. Similarly, Stefan will check whether DIRAC is compatible with RFC proxies.

Maarten will follow up with the experiments; his goal would be to be able to say by the end of the year that WLCG does not need legacy proxies at all.

Machine/Job Features

Middleware Readiness WG

Verified Middleware since the previous meeting:
- Storm 1.11.6 for ATLAS
- dCache 2.11.8 for ATLAS
- DPM-Xrootd 3.5.2 both for ATLAS and CMS
Ongoing verifications:
- CREAM-CE 1.16.5 for CMS
- Storm 1.11.7 for ATLAS
- dCache 2.10.18 for ATLAS ( possible issue found)
- dCache 2.12.0 for ATLAS
tests under setup
- ARC-CE 5.0.0 ( under release ) for CMS
Remember our next meeting in a month Wed March 18th at 4pm CET. Please check the Action List.

Pepe asked if xrootd4 is validated, given that some experiments (like ATLAS) are pushing for its deployment. Andrea M. replied that it is so for DPM but not as standalone server or for other SE implementations. Alessandro clarified that ATLAS has not set a strong deadline and it's not mandatory, but the improvements are useful. Maarten mentions also IPv6 support as an important new feature.

If experiments feel that xrootd4 is validated in their experience, they should communicate it to the MW readiness group. ALICE is stimulating sites to install it since last year and ALICE sites should schedule the upgrade.

Andrea M. agreed to add xrootd to the baseline table and will put ver. 4 as baseline when enough evidence is collected that it's good enough; this should happen within a few weeks.

Christoph asked if, in case a site deployed xrootd4 on a certain SE with success, it should report it: the answer is positive, as it is a legitimate way to validate it.

Multicore Deployment

The first objective of the TF, that is, the understanding of the principles for a successful shared use of the common resources in multicore mode by ATLAS and CMS has been achieved. The initial deployment to the involved sites (T1s for CMS, T1s+T2s for ATLAS) has also been successful and the most popular batch systems' capabilities concerning the use of multicore jobs have been discussed as well.

However, at this point both experiments are working independently on their respective infrastructures. We therefore propose to keep the TF open in "passive mode" while this is ongoing, in order to review the status once both experiments have advanced in their respective models, and in case common matters need to be discussed during this period.

Pepe asked if CMS expressed any interest in the CE parameter passing mechanism; Alessandra replied that so far it did not happen, so it should be considered an ATLAS-only request.

IPv6 Validation and Deployment TF

Andrea S. mentioned that the setup of the FTS3 IPv6 testbed is progressing and the CERN CVMFS Stratum1 is working fine in dual-stack.

Squid Monitoring and HTTP Proxy Discovery TFs

No updates to report

Network and Transfer Metrics WG

WG meeting was held on 18th of February (https://indico.cern.ch/event/372546/)
All sites should be running 3.4.1, final deadline was 16th of February, 5 sites received tickets (2 of them responded)
Follow up campaign to bring all perfSONARs to the correct configuration ongoing, started with LHCOPN/LHCONE instances, several issues found and reported
Testbed established to evaluate/test 3.4.2rc (release candidate), which was released last week. Several issues fixed that were reported by us during LHCOPN/LHCONE configuration campaign. One new issue found and reported to the development team.
New meshes: IPv6/IPv4 dual stack (lead by Duncan Rand), Latin America (lead by Renato Santana, Pedro Diniz)
Testing and evaluation of the pilot instances for esmond/maddash ongoing (psds.grid.iu.edu, psmad.grid.iu.edu)
Production instance of the infrastructure monitoring (psomd.grid.iu.edu) updated with new tests that check completeness/freshness of data in the local measurement archives (high level functional test)

Integration of the network and transfer metrics: two pilot projects proposed in the last WG meeting
LHCb pilot project to provide experiment agnostic prototype to access central datastore (esmond) and publish available metrics to messaging
Extending ATLAS FTS performance study to CMS and LHCb

Networking degradation between SARA and AGLT2 under investigation - to be followed up at the next WG meeting
- Original issue noted when many large file transfers SARA->AGLT2 failed. Cause was FTS timeout since files 2-6GB were moving at 10-100s of Kbytes/sec. Problem reported to this working group.
- perfSONAR regular tests between T2 and T1 have been paused so manual perfSONAR tests were done showing poor performance (200-500 Kbytes/sec).
- Saul Youssef's examination of FTS logs indicated possible problematic trans-Atlantic link was involved. Additional reports of poor performance between CERN EOS and MWT2 used same link.
- Recommended procedure (by LHCONE/LHCOPN working group) is to have either end-site contact their R&E network provider to open a ticket. AGLT2 contacted Internet2 and opened a ticket (ISSUE=2688 PROJ=144)
- Temporary debug mesh setup to test paths between SARA, CERN and AGLT2,MWT2. See https://maddash.aglt2.org/maddash-webui/index.cgi?dashboard=Debug%20Mesh%20(temp)
  - BW graph SARA-AGLT2 at https://maddash.aglt2.org/serviceTest/graphWidget.cgi?url=http://ps.lhcopn-ps.sara.nl/esmond/perfsonar/archive/&source=ps.lhcopn-ps.sara.nl&dest=psmsu02.aglt2.org#
- Internet2 has opened ticket with GEANT(TT#2015022734000453) and the issue is actively being pursued.
  - Work underway getting suitable intermediate perfSONAR instances onto LHCONE to help localize the issue.
Next WG meeting will be on 18th of March (https://indico.cern.ch/event/379017/)

Alessandro asked how to find out that there are known issues with the network; it is agreed that a mailing list with a web archive is an acceptable solution (it might be the WG mailing list).

Shawn will invite Edoardo Martelli to join the WG (in general, an expert for LHCOPN and LHCONE).

It is agreed that the WG will internally discuss a sustainable solution to the problem of how to declare existing issues, or announce actions (e.g. blocking traffic to a problematic site); even if the example given did not happen in reality, procedures should be foreseen before problems arise.

Action list

ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Status: HT-Condor CE tests enabled in production on SAM CMS; sites publishing sam_uri in OIM will be tested via HTCondor (all others via GRAM). Number of CMS sites publishing HTCondor-CE is increasing. For ATLAS, Alessandro announced that with the USATLAS experts a solution has been found to publish the HTCondor CEs in the BDII and OIM in a way that satisfies both ATLAS and SAM needs. It's not a long term solution, but it should be good for the next six months. Before closing the action though we need a confirmation by Marian.
- Ongoing discussions on publication in AGIS for ATLAS.
CLOSED on experiment representatives - report on voms-admin test feedback
- Experiment feedback and feature requests collected in GGUS:110227
CLOSED on Oliver and the experiment representatives - HTTP deployment task force
- The experiments should give their position of the need and the mandate of the task force

AOB

The next meeting is on March 19th.

-- AndreaSciaba - 2015-03-03

Topic revision: r21 - 2018-02-28 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback