LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes160407 (2018-02-28, MaartenLitmaath)

EditAttachPDF

WLCG Operations Coordination Minutes, April 7th 2016

Highlights

Sites should review the WLCG scope tags for their services in GOCDB
New version of Argus fixing instability issues is available for CentOS/EL7 in the EGI preview repository (not yet ready for release)
Experiments have summarised their use of Accounting Portal and Accounting Reports and reported about existing issues with Accounting. This summary will be presented next week at the pre-GDB.
Next Ops Coord meeting on 28th April will be dedicated to Networking and Metrics WG review

Agenda

https://indico.cern.ch/event/514077/

Attendance

local: Maria Alandes (minutes), Maarten Litmaath, Stefan Roiser, Oliver Keeble, Jerome Belleman, Andrea Manzi, Julia Andreeva, Ale Di Girolamo
remote: Alessandra Forti (chair), David Bouvet, Di Qing, Eric Lancon, Javier Sanchez, Stefano Belforte, Thomas Hartmann, Ulf Tigerstedt, Gareth Smith, Alessandra Doria, Alessandro Cavalli, Kyle Gross, Xavier Mol, Stuart Pullinger, Philippe Charpentier, Nurcan Ozturk, Stephan Lammel, Felix Lee
apologies: Catherine Biscarat, Vincenzo Spinoso

Operations News

Next week pre-GDB is on accounting today is dedicated to collect feedback for the discussion that will be held at the pre-GDB.
Sites should have received a broadcast asking them to review the new WLCG tags in the GOCDB portal https://goc.egi.eu/portal
Reminder that reports while they have to be written will not be read anymore out loud. There will be a round table to see if there are comments or questions which should not exceed 30mins. Most of the meeting should be dedicated to the topic which today is accounting. For this to work people should really write the reports in advance and give a chance to fellow WOC'ers to parse them.

Maarten explains that WLCG scope tags were bootstrapped by WLCG experiment liaisons who gave the list of sites per LHC VO. Sites can modify the tags associated to their services, but not the tags associated to the sites. Maria adds that it is planned to validate WLCG scope tags using the experiment VOfeeds. Stefan explains that LHCb VOfeeds do not include T3s, so this needs to be complemented with something else. Maria explains that experiments will be contacted in any case when this validation takes place, so this will be followed up offline.

Middleware News

Useful Links:
Baselines/News:
- EGI preview repo update for CentOS7 https://wiki.egi.eu/wiki/Preview_2.0.0. It Includes a new version of Argus (1.7.0) which is fixing some instability problems. This version is under testing @CERN.
Issues:
- Big issues last week affecting ATLAS, CMS and LHCb due to a CERN VOMS problem
  - CMS( ASO), ATLAS and LHCb FTS Jobs submitted via REST got jobs assigned to a wrong VO because the VOMS extensions could not be extracted from proxies. Issue fixed with proxies re-delegation + manual DB changes by the FTS admins
  - CMS Phedex transfers could not be submitted to FTS, solved by WLCG broadcast to sites to ask to update their proxy.
  - ATLAS Pilot Jobs submission to Cream-CE failed, proxies update needed.
  - All issues have been analysed and 2 GGUS issues have been opened to VOMS and gridsite devs (GGUS:120463 and GGUS:120618). Basically the VOMS proxies have been generated with a VOMS AC longer than the VOMS certificate expiration, this has caused gridsite based services to not recognize that AC (FTS REST and CREAM) and others to stop accepting those proxies when the VOMS Server cert expired (FTS SOAP).
  - The service managers are implementing an automatic method for renewal of the special host certificates needed by a few grid services (VOMS, Argus).
    - Those certificates have the service alias instead of the hostname in the subject DN.
T0 and T1 services
- CERN
  - CASTOR v2.1.16 / SRM 2.1.16 for all LHC instances
- JINR-T1
  - dCache upgraded to v 2.10.58.
- NDGF
  - dCache upgraded to v 2.15.3, possible upgrade to 2.15.4 next week
- NT-T1
  - SURFsara did a major dCache Upgrade to 2.13.29 and Postgres to 9.5 (which should fix a deadlock issue: spacemanager/internal timeout from user point of view)
  - SURFsara will move all grid hardware to a new datacenter in October 2016. We expect to be down for 2 weeks. Single copy disk data may be at risk
- RAL
  - All CASTOR disk servers on SL6. Started writing Atlas data to T10KD drives. Update SRMs to version 2.14 planned.
  - Migration to having a load balancer (HAProxy, Keepalived) in front of FTS3 in progress

Tier 0 News

CERN's pledges for 2016 are available to the experiments. The additional request for CPU resources for the ATLAS Tier-0 facility has been addressed by ensuring better efficiency.
In order to better understand losses of efficiency, we will soon run profiling tools on a small subset of jobs in CERN's batch farm, which may cause these jobs to run somewhat slower.
The forthcoming installations (~May) will be made available with SMT enabled.

DB News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

High activity on average
- New record 99376 running jobs on March 29
CASTOR
- The disk pools are being prepared for data-taking mode and subsequent tests
- HW plans for summer will be discussed shortly
EOS-ALICE: new head nodes with 512 GB memory were included transparently
- That should prevent further out-of-memory incidents, thanks!
VOMS-Admin
- All the stale duplicate user records have been removed
- All experiments contributed to a set of RFEs being collected by ATLAS

ATLAS

Large scale MC15c Monte-Carlo reconstruction campaign is ongoing.
Heavy Ion bulk processing is being planned, target is to use 20k slots and complete production by the end of May, high memory slots (24-32 GB) needed, limited resources.
Upgrade related production is running on single-core.
Not full production during Easter, running at ~60%-70% due to few operations issues: FTS server issue at BNL, VOMS server certificate issue (services fail to extract FQANs from the proxy), INFN-T1 downtime, aCT issue due to problems on the underlying AFS volume.
T0 export test to Castor yesterday, from EOS to CASTOR, 1 GB/s nominal rate, successful, reached 3.5 GB/s.

CMS

General Status
- CMS Detector ready for 2016 running
- First splashes/collisions observed in late March
MC Production
- The Spring16 DigiReco production with latest CMSSW 80x has started
- more than 2.5B evts requested, and growing. Progressing well, first couple of million events produced
- campaign started in single-core mode, enabling Tier-2 sites for multicore pilots in progress
PhEDEx update: almost all Tier2 moved to Phedex 4.1.7 (recommended version)
- Sites not running the required version yet received a GGUS ticket
SAM and dashboard issues observed in the last month
SAM test infrastructure will switch from Nagios to ETF by mid April
Recent authentication problem, manifested in two ways:
- ASO not communicating with FTS (March 28th, GGUS:120454), needed restarting of proxy on ASO services
- Phedex @ sites not communicating with FTS (March April 1st, GGUS:120536), needed broadcast to sites to renew proxies

LHCb

High activity with ~70K jobs running last two weeks
Incremental Stripping stopped due to large CPU consuming
Turbo(Cal) reprocessing is ongoing
Productions submitted to recover lost files at CNAF
MC Sim09 final test are almost ready

Ongoing Task Forces and Working Groups

HTTP Deployment TF

Meeting on 23rd March - https://indico.cern.ch/event/509854 (minutes attached to agenda)
- Decided to conclude the TF once ETF has moved to production and the monitoring falls under experiment responsibility.
  - Will continue to track and ticket sites in the meantime
  - A couple of updates to the Nagios probe are also scheduled
- No further iterations on the TF documentation, apart from cleanup for final publication
- No support from the TF on the proposal to extend the storage provider collaboration any further

Alessandra asks about the meaning of the last point and how definitive it is. Julia says that WLCG Operations believes that a storage provider collaboration is needed. Oliver says that there has been no expression of interest within the TF from the storage providers, although experiment representatives also believe that this is a good idea. Julia proposes to think of a mandate of the WG and present this to the MB. Oliver suggests to discuss this mandate with the storage providers first. It is agreed to organise a dedicated meeting with storage providers to discuss this.

Information System Evolution

Targeted timeline for OSG BDII decommissioning is March 31st, 2017. It may be available in a unsupported manner beyond that date, but right now that will be the last day it will be run as a production OSG operational service.
An IS TF meeting took place on 31st March:
- Medium term plans were discussed
- A feasibility study to stop dependencies on the BDII is being carried out.
  - A first test with pic and storage was carried out in the past weeks and this shows SAM dependencies on SAM OPS tests that need to be discussed with EGI.
  - Discussions with Marian to understand current BDII dependencies on VOfeed generation.
- Discussions with GOCDB/OIM developers to understand how to add more static information
- WLCG scope tags in GOCDB need to be validated

IPv6 Validation and Deployment TF

Machine/Job Features TF

Implementations development: https://twiki.cern.ch/twiki/bin/view/LCG/MachineJobFeaturesImplementations
PBS/Torque implementation done and deployed at Manchester, Cambridge, and PIC. Many thanks for useful feedback.
HTCondor implementation still being worked on.

Middleware Readiness WG

Tier1s were invited at the MWR and the the Ops Coord meetings on 16 & 17/3 to tell the e-group wlcg-ops-coord-wg-middleware at cern.ch whether they agree to install the pakiti client on their production service nodes, so that the versions of MW run at the site be known to authorised DNs site managers taken from GOCDB and expert operations' supporters. The developer Lionel Cons stops further work on the tool. Site replies on their intention to expand the use of pakiti client can be found here. Maria D. will put 2 reminders to sites in the twikis of the April 7th and 28th Ops Coord meetings, as part of the MW Readiness WG report. All the input received will be concatenated and attached to the MW Readiness WG agenda of May 18th. Then we'll conclude on the issues.

Network and Transfer Metrics WG

WLCG Network Throughput SU:
- CBPF connectivity (https://ggus.eu/index.php?mode=ticket_info&ticket_id=120081) - resolved
- ASGC connectivity (https://ggus.eu/index.php?mode=ticket_info&ticket_id=119820) - ongoing
- Packet loss and high latency for certain packets (queuing issue ?) reported by perfSONAR on ASGC to CERN, but not confirmed by the counters
- Narrowed down to the StartLight to ASGC segment, but unfortunately there are very few sonars in Asia with very limited peering, which will impact further investigation
- Throughput tests show peaks of 400Mbit/s (200Mbit/s usual) with frequent retransmissions occurring in bunches, we'll try to run tcpdump to understand the root cause
Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard)
Throughput meeting held on 6th April:
- Jason Zurawski from ESNet presented the art of debugging network issues with perfSONAR
- Next meeting is 14th of April (https://indico.cern.ch/event/517373/)
Update on WG will be presented at HEPiX in DESY
WG review will take place at the next WLCG ops coordination on 28th April
Added section with useful links to the WG homepage https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics

RFC proxies

Waiting for the transition to ETF, which will use RFC proxies for all SAM tests

Squid Monitoring and HTTP Proxy Discovery TFs

Nothing to report

Accounting Review

Introduction

Maria explains that at next week's pre-GDB, WLCG Operations will present a summary of the feedback provided by experiments today. WLCG Operations have also several concerns related to accounting like:

Relationship of accounting and benchmarking. Very little progress of the benchmarking WG
Multicore accounting in the development portal to be released to production ASAP
Clear definition of capacities and what it is needed by experiments (Installed vs Available vs Pledges)

During the session, many things were discussed. In the next sections there is a summary of the input presented by experiments. Some highlights of the discussion are also included in the list below:

It is important to have the same information everywhere. Accounting portal and accounting reports show different information, since accounting reports in REBUS currently let T1s modify accounting portal numbers manually.
Many experiments are requesting to have pledges per sites and not per federations (This has been also discussed at the IS TF and brought already to the MB in September 2015, where it was agreed to have these values with a different name than pledges, and to come with a proposal to the MB on how to collect them).
Since IS TF is working on a prototype for a new WLCG IS, most of the feedback given for REBUS falls into the features that are planned to have in the new IS. If the new IS is approved, it's very likely that REBUS is not updated with new features. In any case, the given feedback is useful for one or another service.
Most of the use of accounting portal/reports done by experiments is to generate CRSG reports twice per year. Sites make a more regular use of the information and also generate reports that are shown to the funding agencies.

ALICE

For more details, please check slides

Use of Accounting Portal? NO
Use of Accounting Reports? YES
Multicore Accounting: No multicore in ALICE
Opportunistic resources Accounting: Not interested in doing this centrally, currently done in Monalisa
Space Accounting: Direct accounting for xrootd/EOS in Alien. Indirect accounting for DPM and dCache
REBUS:
- Used for pledges and VO requirements
- Pledges for T2s per sites and not per federation
- Include non WLCG resources

ATLAS

For more details, please check slides

Use of Accounting Portal? YES to generate the CRSG reports, but this is very painful because the portal is not easy to use as shown by Alessandro, numbers are incorrect and ATLAS eventually uses its own dashboard numbers
Use of Accounting Reports? NO
Multicore Accounting: YES but numbers from the portal can't be trusted
Opportunistic resources Accounting: YES if WLCG overview is needed but ATLAS not particularly interested in central accounting for this
Space Accounting: gfal clients
REBUS:
- Used for pledges and core power
- Nice to have available capacities
- Include opportunistic resources

CMS

For more details, please check slides Disclaimer: Note that the info below is based on Maria's notes taken during the CMS Offline and Computing Week that is still going on. For this reason CMS hasn't had the time to process all this information yet.

Use of Accounting Portal? YES to prepare CRSG reports. However numbers should be validated wrt CMS dashboard since there are differences.
Use of Accounting Reports? NO
Multicore Accounting: T1s multicore numbers better with development portal that should be moved to production asap.
Opportunistic resources Accounting: YES, all CMS usage should be reported in Acc portal
Space Accounting: Special presentation with all details during pre-GDB
REBUS:
- Used for pledges
- Nice to have available capacities
- New feature: notifications if information changes

LHCb

For more details, please check slides

Use of Accounting Portal? YES to create CRSG reports
Use of Accounting Reports? NO
Multicore Accounting: deployment of multicore is in progress
Opportunistic resources Accounting: DIRAC is used, and this is also reported to CRSG
Space Accounting: srm
REBUS:
- Used for pledged storage
- Not clear whether LHCb would like pledges per T2s instead of federation (Stefan should clarify this)

Sites

Accounting portal

The Feedback for the old portal (below) was for the current portal however yesterday the link to the new redesigned portal [1] was distributed and I assume comments for the current portal are now obsolete. I've opened a ticket in the EGI RT for the new portal with first impressions [2].
The new portal according to the announcement should be released at the end of April but wasn't circulated before and I don't think it covers WLCG needs yet. In particular HS06 pages hang and most of all it doesn't seem to have the walltime multiplied by the number of cores (i.e. multicore support is not there).
An accounting TF to review the new portal and make a list of requirements and interact with EGI is recommended.

Feedback for the old portal:

Discrepancies among different views EMI3(WLCG) vs EGI for example. Multicore support has been added to EMI3(WLCG), T1 and T2 views but not others making really difficult to navigate from one view to the other.
- Efficiencies seem displaced between EGI and EMI3(WLCG) might be just an order of columns but wrong efficiencies appear in different columns. Other views may have the same problem.
Difficult to relate the numbers to sites and VO managers views (where detailed users usage is displayed) less selection possibilities
API (csv_export.php) not well documented, difficult to check number programmatically
- Need to create the view first and manipulate the URL after.
- REST API mentioned in the past but not there
Probably a lot of this would be solved by the (old) development portal but also the (old) development portal shows discrepancies

[1] https://accounting-devel-next.egi.cesga.es/
[2] https://rt.egi.eu/rt/Ticket/Display.html?id=10875&results=8e3a964f1d3def64d8bd502d39f79c17

Accounting Reports

Already discussed at length after the WLCG workshop. A presentation was given at the February MB [1] requesting
- To use walltime and compare walltime with walltime. At the moment cputime is still used and cpu efficiency factors were removed 2 years ago
- In addition T1 and T2 monthly reports show the CPU accounting values (both CPUtime and WALLtime) using the same units: HEPSPEC06-days
An accounting TF was recommended for other topics listed in the presentation conclusions.

[1] https://indico.cern.ch/event/459359/contribution/4/attachments/1229392/1801384/20160216_Accounting_Tier1s.pdf

Multicore accounting

All sites - bar a handful - are accounting for multicore walltime. The only problem now is displaying it correctly on all the views in the accounting portal. Also as said in the accounting portal item the new portal hasn't taken this into consideration.

Opportunistic resources

Virtualised WN can report in a similar way as the normal WNs
Non EGI cloud resources and HPC resources support is not there and needs agreement on how to benchmark them. This is another point that needs a TF to agree on how to proceed. It can be the already existing benchmark TF or a new accounting TF or both in cooperation.

Space accounting

Storage Accounting is still in early stages of deployment, and has adopted a schema from Cloud Resource accounting. It is not clear if what the requirements of the accounting is, and more engagement with the TF is recommended.

REBUS

Reports pledges correctly as they are inserted manually
Reported capacity depends very much on what sites publish in the BDII and it is not always correct nor validated by sites.
- There is a discussion about replacing "installed capacity" with "available capacity" which is considered more accurate and get the sys admins to manually validate it every month. This maybe ok for T1s, for T2s needs to be thought out IMO. Also it is not clear if this requirement is for experiments ops teams which may want this information to plan production campaigns and need the information in advance or if it is an accounting requirement and in that case it is similar to validate afterwards the numbers of the reports circulated by the WLCG office which are generated from REBUS.

In general there are many loose thread that should be discussed in a forum and an accounting TF is recommended to go through them.

Action list

Creation date	Description	Responsible	Status	Comments
17.03.2016	The Operations' team to check with ALICE, ATLAS and LHCb whether the gLExec monitoring can be stopped.	Maarten	Ongoing

Specific actions for experiments

Creation date	Description	Affected VO	Affected TF	Comments	Deadline	Completion

Specific actions for sites

Creation date	Description	Affected VO	Affected TF	Comments	Deadline	Completion
17.03.2016	The Operations' team to follow-up ASGC's progress on the networking issues experienced there and report progress at the Ops Coord meeting	ATLAS	network metrics

AOB

-- MariaALANDESPRADILLO - 2016-04-05

2016_April_ATLAS-and-Accounting.pdf: 2016_April_ATLAS-and-Accounting.pdf

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
pdf	2016_April_ATLAS-and-Accounting.pdf	r1	manage	1675.9 K	2016-04-07 - 16:05	AleDiGGi

Topic revision: r25 - 2018-02-28 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback