LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes141106 (2018-02-28, MaartenLitmaath)

EditAttachPDF

WLCG Operations Coordination Minutes - November 6th, 2014

Agenda

https://indico.cern.ch/event/341736/

Attendance

local: Nicolò Magini (secretary, CMS), Andrea Sciabà, Alessandro Di Girolamo (ATLAS), Andrea Manzi (Middleware officer), Maarten Litmaath (ALICE), Pepe Flix (PIC), Antonio Perez-Calero Yzquierdo (PIC), Nathalie Rauschmayr, Marian Babik, Felix Lee (ASGC), Marcin Blaszczyk, Dave Dykstra

remote: Alessandra Forti (chair), Yury Lazin (RRC-KI-T1), Andrej Filipcic (ATLAS), Massimo Sgaravatto, Micheal Ernst (BNL), Jeremy Coles (GridPP), Rob Quick (OSG), Thomas Hartmann (KIT), Ulf Tigerstedt (NDGF), Di Qing, Alessandro Cavalli (INFN-T1), Renaud Vernet (CCIN2P3), Burt Holzman (FNAL)

Operations News

Reminder to look at the WLCG site operations survey and send feedback until the GDB. Survey will be made public afterwards.

Middleware News

the WLCG repository is planned to become signed next week
- the change will be announced to all WLCG sites
- it is backward-compatible: old configurations ignore the added signatures
- new repository files will be available in rpms that also provide the public key of the repository

Useful Links:

Baselines:
- gfal/Lcgutil is unsupported starting from 1 november. ( removed from baseline table) The replacement is gfal2/gfal2-util already available in EMI repos, to be added to UMD this week.( together with new metapackages for UI and WN)
- new reinstallation instructions for perfSonar available . All information have been broadcasted by the Network and Transfer Metric WG
MW Issues:
- nothing to report
T0 and T1 services
- CERN
  - Castor upgrade to 2.1.14-15 by the end of November
- IN2P3
  - dCache upgrade to 2.6.36 on pools. xrootd configuration for FAX and AAA to respect EU privacy policy
  - FTS2 decommissioned
  - FTS3 upgraded to 3.2.29
- BNL
  - FTS3 upgraded to 3.2.29
- JINR-T1
  - srm-cms-mss.jinr-t1.ru upgrade to 2.10 on 07-11-2014 ( dCache tape)
- KIT
  - Update of dCache for CMS and LHCb to 2.6.35. FAX configuration updated to meet EU privacy policy.
  - Update xrootd configuration for AAA to respect EU privacy policy pending
- NL-T1
  - dCache upgraded to 2.6.37
- RRC-KI-T1
  - Tape dCache upgraded to 2.6, LHCb dCache upgraded to 2.6.34
  - planning to Upgrade disk dCache for ATLAS to 2.6 and Upgrade tape dCache to 2.10
- Triumf
  - Tape System upgrade, adding more tape capacity on Nov. 12th

Pepe Flix adds that PIC is planning the upgrade to dCache2.10 in December, pending the validation of the needed fix for the N2N plugin for the FAX federation.
NDGF is planning an upgrade of dCache from 2.10.9 to 2.10.10; and 2.11 is in testing. Not planning to use N2N.
Agreed to test integration of federation plugins with new storage versions as part of the middleware readiness activities.

Oracle Deployment

IT-DB new hardware installations in: CERN computer centre and Wigner.
Timeline: testing in October, production move - by the end of 2014. Schedule will be updated accordingly.
Following table includes only those DB services that concern WLCG

Database	Comment	Destination	Upgrade plans, dates
ATONR	Data Guard for Atlas Online	Wigner	Done
ATLR	Data Guard for Atlas Offline	Wigner	Done
ADCR	Data Guard for ADC	Wigner	10-15 Nov
CMSR	Data Guard for CMS offline	Wigner	Done
LHCBR	Data Guard for LHCB Offline	Wigner
LCGR	Data Guard for WLCG	Wigner
CMSONR	Data Guard for CMS Online	Wigner	Done
LHCBONR	Data Guard for LHCB Online	Wigner
CASTORNS	Data Guard for Castor Nameserver	Wigner	Done
ATONR	Active Data Guard for Atlas Online	CERN CC	Done
ALIONR	Active Data Guard for Alice Online	CERN CC	10-15 Nov
ADCR	Active Data Guard for ADC	CERN CC	10-15 Nov
CERNDB1	CERNDB1/EDMSDB/ACCDB	CERN CC	Tuesday, 18th November 18:00-19:00

Upgrade Plans Table from Q1&Q2 2014

The CMSONR Data Guard in Wigner is not an active instance, it is used for backup. Dave Dykstra suggests to also have a CMSONR Active Data Guard instance in Wigner, for redundancy with the Meyrin CMSONR ADG serving conditions through Frontier. Marcin to follow up with DB group.
Alessandro Di Girolamo gives a heads-up to DB team about upcoming migration from DQ2 to Rucio, which could change load patterns on ADCR.

Tier 0 News

Post mortems from recent ALARM tickets:
- VOMS server not providing proxy anymore (GGUS:109418): https://twiki.cern.ch/twiki/bin/view/DB/PostMortem18Oct14 (DBoDM)
- CVMFS installation server is down (GGUS:109500): https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentCVMFSRelaseNodes

Alessandro Di Girolamo reminds that the experiments will give a review of the criticality of services like VOMS, but criticality of underlying services e.g. DB on demand is the responsibility of the sites. Maite agrees that it needs internal discussion.

AFS UI: Statistics after closing lxplus5, some clients that showed up in random 3-sec sampling: vocms228, vocms06, lx1b21.physik.rwth-aachen.de, vocms237,vocms31, it-hammercloud-submit-atlas-01

Full statistics on current AFS UI usage to be provided when possible.
Andrea Sciabà comments that vocms06/228 are going to be retired in a few days.

SNOW intervention this Saturday morning, GGUS will be available: A new major release will be applied to Service-now at CERN, called the Eureka release, next Saturday, November 8. The planned intervention and time schedule is described at: https://cern.service-now.com/service-portal/view-outage.do?n=OTG0015527
During the first part of the intervention, approximately from 06:00 to 09:00 the system will be completely unavailable. During the second part, approximately from 09:00 to 12:00 the performance risks to be affected. *During the intervention time, it is recommended to use GGUS to respond to WLCG tickets originated that way (instead of answering through SNOW)*. The ticket transfer from GGUS to SNOW will be interrupted, hence some tickets may run out of sync. This will be fixed after the upgrade. WLCG alarm tickets should not be impacted by the SNOW intervention.

voms-admin test cluster is again available here: https://voms-testing.cern.ch:8443
- We have opened a SNOW REQ to gather feedback, RQF:0386494
- feedback received from Alice

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

very high activity during the last 11 days
- 69k concurrent jobs reached Nov 5
recent changes caused each pilot (job agent) to exit after running a single task
- now it again runs as many tasks as it can fit
CERN: the batch team found a bug that caused the CEs to publish bad job numbers occasionally
- ALICE job submissions depend on such numbers being reasonably accurate
RAL: the ARC CEs now handle almost all ALICE jobs
- the SAM tests still need some work, should be fine soon
the new VOMS servers are used across WLCG since a few weeks

Maarten Litmaath comments that it's not yet clear if the bug in CE publication is CERN-specific or potentially affecting other LSF sites.

ATLAS

Workload on the Grid: over the past weeks there were not so many jobs. This week the Grid is full, but we foresee again in few days some lack on jobs. MCORE slots also are affected by this "not enough workload". MC15 will likely start end of the year
Critical Services: we were contacted by WLCG ops coords, we will update the table, but still there are points (like the underlying services on which the critical services are relying on) which are outside this review
many stability issues with Panda and Rucio servers due to package upgrades (nss, openssl): disable the automatic upgrades for critical services and do it in a controlled manner on preprod/test boxes
rucio, prodsys2 progress: migration to rucio starting mid November
prodsys2 used for 10% of MC tasks, group production, analysis
- testing the MC production full chain in Prodsys2/Rucio ongoing, not fully completed yet
multi-core still pending sites to enable it everywhere
- resource limits passing to the batch queues:
  - need an agreement on glue parameters
  - need example implementation for all the systems
  - APF and aCT ready

Andrea Sciabà reminds to send updates about service criticality to him instead of updating the twiki directly.
Andrea will present officially the review of critical services at GDB/MB once the information from all experiments is collected; sites can get a preview if interested.

CMS

This week is CMS Offline and Computing week (at CERN)
- Many people are even more busy
Processing overview:
- New production campaign of MINIAOD (Tier-1 and Tier-2 sites)
- Preparations for new campaigns (PHYS14 and another round of Upgrade) ongoing
List of critical WLCG services reviewed by CMS this week
- Computing coordination send update to Andrea
- Still to be understood from CMS side
  - CVMFS reliability when a Stratum-1 goes down (or falls behind in terms of status)
  - CVMFS Stratum-0 needs to include the machine used by CMS for publishing
Service Migrations
- AFS-UI
  - Was access statistics re-done after closing of lxplus5
- Testing of new VOMS server infrastructure
  - Little progress, no problems seen
    - CERN PhEDEx Debug transfers now running with proxy from voms2
  - Will ramp up to meet the know deadline (end of November)
Operational items
- Reconfiguration campaign of xrootd for European sites started beginning of this week
Data transfer test T0->Tier-1 and Tape staging exercise planned for November/December
- Details will be communicated to sites via CMS contacts (and relevant CMS HN)

LHCb

apologies from LHCb, due to the ongoing LHCb Computing Workshop we cannot attend the meeting today

Stripping 21 - "legacy Run1 data set" stripping
- pre-staging of input data is progressing well (~ 1/3 of 2012 data is staged)
- verification of the production is ongoing - start of the campaign estimated during next week
IPv6 - fixes done for the DIRAC framework which should enable it to run in IPv6 mode
- dedicated dual-stack certification node has been requested for larger scale testing
New VOMS servers firewalls for LHCb have been opened world-wide.
- The upcoming LHCbDIRAC patch release will use both old and new servers
- If all goes well the release after will only use the new servers
voms-admin tests to be done

Ongoing Task Forces and Working Groups

gLExec Deployment TF

over the last few months the use of gLExec by PanDA has been completely revised and improved by Eddie Karavakis
for each site the usage can be configured:
- never
- always (mainly for testing)
- only when it works (as determined on the spot by the pilot)
a deployment campaign already covers the T1 sites and 17 T2
- to gain insight in the effects on stability and success rates across WLCG

Maarten and Alessandro clarify that the deployment campaign involves creating new Panda queues with glExec enabled, so it doesn't require additional action from the sites. ATLAS will decide on the migration procedure based on the experience.

Machine/Job Features

Multicore Deployment

Passing parameters to the batch system: discussion started within the TF, so far only few sites participated.
ATLAS:
- 40% of the production resources used occupied by multicore since the beginning of September.
- Still 37 sites to move. Work needed both on sites and atlas side to setup the queues
- Atlas very interested in the outcome of the passing parameters discussion. Setting them up on ATLAS side will not take long once parameters decided
CMS:
- Multithreaded reconstruction application ready, first test workflows run at PIC, now moving to submission tests at T1s scale
- Deployment to T2s: multicore pilot submission tests to be started with some candidate sites
Multicore TF presentation next week at the GDB

On Alessandro's question, Andrea Sciabà and Alessandra Forti remind that the task force mandate includes deployment, so the TF itself can follow up with sites for the deployment of its recommendations on job attributes.
Alessandro asks if Panda can start sending job attributes. Alessandra comments that this will not have bad effects since sites will currently not use them. Pepe and Antonio suggest to wait until the TF recommendations are more mature.
Alessandro reminds that some recommended attributes are in Glue2 schema, which is not deployed in OSG. Rob Quick comments that currently there is no use case for Glue2 in OSG; if experiments have a request it will be evaluated. Rob Quick will be involved in the TF for the Glue 1.3 <-> Glue 2 decision.
Rob Quick will involve the HTCondor-CE developers in the task force, to understand if HTC-CE supports passing job attributes.

Middleware Readiness WG

Pepe Flix reminds that the PIC dCache test instance is not dimensioned for stress tests; this is OK for the MWR WG activities.

SHA-2 Migration TF

introduction of the new VOMS servers
- Ops and ALICE are done
- LHCb date to be agreed
- CMS - any news?
- ATLAS - any news?
- OSG will do a release on Nov 11 with the new servers added to the vomses files for ATLAS and CMS (GGUS:109265)
  - voms-proxy-init may then try the new servers first
  - it would be best if by that time the firewall is open also for those VOs!

IPv6 Validation and Deployment TF

Squid Monitoring and HTTP Proxy Discovery TFs

OIM is now ready, as are the WLCG Squid Registration instructions
Waiting on Alastair Dewhurst to make a few changes to the automated MRTG monitor
Plan to start registration campaign later this month

Network and Transfer Metrics WG

WLCG OPS broadcast sent with new install/update instructions for perfSONAR 3.4+, requesting ALL sites to update before 8th of December
New perfSONAR configuration system in production, announced as part of the WLCG OPS broadcast
Two major milestones close to be finalized ( T2.1 perfSONAR Commissioning/Operations, T2.3 perfSONAR Configuration)
Testing and validation of the perfSONAR data store ongoing - production date to be announced
Next perfSONAR operations meeting to be held next week (http://doodle.com/qydib32fkv48er2r )
Metrics area meeting to be announced

Marian to check read permissions on Jira issues.
Alessandro asks if experiments can start to use perfSonar monitoring in operations. Marian answers that with upcoming deployment the metrics should be usable; the documentation will include troubleshooting guide for experiments at all levels: site perfSonars, data store, maddash.

Action list

ONGOING on the experiments: check the usage statistics of the AFS UI, report on the use cases at the next meeting.
- Maite to provide full updated statistics after lxplus5 closure.
ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Status: HT-Condor CE tests enabled on SAM CMS pre-production, plan is to go production next week - all sites publishing sam_uri in OIM will be tested via HTCondor (all others via GRAM).
- Ongoing discussions on publication in AGIS for ATLAS.
ONGOING on experiment representatives - report on voms-admin test feedback
ONGOING on Andrea Sciaba - review the critical services table
CLOSED on MW officer - bridge communication between MW Readiness WG and EGI UMD Staged Rollout. MW Officer regularly attending UMD meetings; in recent case with dCache upgrade, the sites were taking the update directly from dCache.org rather than UMD to fix a critical bug.
ONGOING on Alessandro Di Girolamo - suggest to monitoring team to organize common meeting with experiments to collect requirements for new SLS dashboards.
CLOSED on Marian Babik - organize meeting to define details of IPv6 testing activities with SAM and perfSONAR. Status: perfSONAR in dual-stack recommended to ALL sites in the current install/update instructions for perfSONAR 3.4+. SAM coordination meeting will be held tomorrow (IPv6 is on the agenda).

AOB

Rob Quick comments that OSG is ready to send multicore accounting data to APEL. Rob Quick and Alessandra Forti to contact APEL team to ask if APEL is ready to receive it.
Burt Holzman asks to review the information flow to OSG for the VOMS incident. Rob Quick suggests to notify the OSG GOC at goc@openscienceNOSPAMPLEASE.org and then OSG will take care of broadcasting to its users. Agreed - mail to goc@openscienceNOSPAMPLEASE.org to be added to CERN procedures in case of outage of 'global' services like VOMS or myproxy.

Next meeting on November 20th

-- NicoloMagini - 2014-11-03

Topic revision: r25 - 2018-02-28 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback