TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsWeb
>
WLCGOpsCoordination
>
WLCGOpsMinutes161201
(2018-02-28,
MaartenLitmaath
)
(raw view)
E
dit
A
ttach
P
DF
---+!! WLCG Operations Coordination Minutes, December 1st 2016 %TOC{depth="4"}% ---++ Highlights * The Network and Transfer Metrics WG asks for experiments' participation at the dedicated Jan. 10th pre-GDB. * The RFC Proxies TF completed its mission and will be closed. * CMS is campaigning with its sites the migration to FTS3 client till year end, provided all needed !PhEDEx changes are completed. * The Lightweight Site survey http://wlcg-survey.web.cern.ch/survey/lightweight-sites is still open for contributions. * Sites and Experiments, please provide feedback for the Advanced Warning of Long Shutdowns proposal [[https://twiki.cern.ch/twiki/pub/LCG/WLCGOpsMinutes161201/LongShutdowns.pptx][available HERE]] before Jan 5th by sending a mail to wlcg-ops-coord-chairpeople at cern dot ch. ---++ Agenda * https://indico.cern.ch/event/540424/ ---++ Attendance * local: Maria Alandes (chairperson), Maria Dimou (notes), Alberto Aimar, Maarten Litmaath, Julia Andreeva, Marian Babik, Jerôme Belleman, Marcelo Soares, Andrea Manzi, Alexander Kryukov, Mark Slater, Andrew !McNab: * remote: John Gordon, Alessandra Doria, Catherine Biscarat, Stephan Lammel, Di Qing, Dave Mason, Kyle Gross, Thomas Hartmann, Ulf Tigerstedt, Frédérique Chollet, Alessandra Forti, Christoph Wissing, Massimo Sgaravatto, Oliver Keeble, Renaud Vernet, Antonio Perez-Calero Yzquierdo, Carlos Acosta, Hung-te Lee, Robert Ball, Pepe Flix, Elena Korolkova. * Apologies: Nurcan Ozturk ---++ Operations News * Next WLCG Ops Coord meeting will be on 12th January 2017 (exceptionally skipping the 1st Thursday of the month rule!) * There will be a WLCG Workshop in 2017 (End June/Beginning July). Sites interested in hosting the workshop, please, get in touch with =wlcg-ops-coord-chairpeople@cern.ch=. More details coming soon. ---++ Middleware News * Useful Links: * [[https://wlcg-mw-readiness.cern.ch/baseline/current/][Baseline Versions]] * [[WLCGBaselineVersions#Issues_Affecting_the_WLCG_Infras][MW Issues]] * [[WLCGT0T1GridServices#Storage_deployment][Storage Deployment]] * Baselines/News: * As broadcasted by EGI, sites should start using *UMD4* repo instead of UMD3 * see [[https://wiki.egi.eu/wiki/UMD3_UMD4_products]] for a list of products (un)supported in UMD4 * UMD 4.3.0 has been released: [[http://repository.egi.eu/2016/11/10/release-umd-4-3-0/]] . * UMD 4.3.1 also released: * !CentOS7: =lcas-lcmaps-gt4-interface 0.3.1, lcmaps 1.6.6, lcas 1.3.19, glexec-wn 1.3.0, lcmaps-plugins 1.7.1 etc.= * SL6: =ARGUS 1.7= * We suggest sites move to =ARGUS 1.7= as it fixes instabilities (to be added as baseline at some point) * New versions of =xrootd-server-atlas-n2n-plugin= and =dcache-xrootd-n2n-plugin= pushed to the WLCG repo. * They fix a problem reported at GGUS:124689 * =glite-yaim-core-5.1.4-1= released to WLCG repo (and under integration in UMD) * It fixes GGUS:125108 (affecting !CREAM) and adds EL7 support * =glite-yaim-clients-5.2.1-1= released to WLCG repo. It fixes some issues related to WLCG VOBOX on EL7 * Issues: * NTR Maarten reminded that EMI3 repository is no longer maintained. Please move to UMD4 a.s.a.p. * T0 and T1 services * BNL * FTS upgraded to v 3.5.7 * CERN * check T0 report * FTS upgrade to v. 3.5.7, Disabled SOAP on the Pilot cluster * INFN * If Atlas does not recover the file protocol they need to increase the number of gridftp server * JINR * dCache minor upgrade 2.13.48 -> 2.13.49 * KIT * Update of dCache for CMS to 2.13.48, which fixes issues with dccp pre-staging. Created cmsxrootd-2.gridka.de as a new CMS AAA redirector, though no feedback yet. * FAX services will be decommissioned with next opportunity at KIT, though no date is fixed as of now * RAL * FTS upgraded to v 3.5.7 * RRC-KI * dCache for ATLAS and LHCb disks storage upgraded to 2.16.18-1 * Upgrade dCache for tape instance to 2.16.18-1 planned ---++ Tier 0 News * Some CC7-based capacity has been made available in HTCondor for local submission. * An issue with local job submission to HTCondor has been acknowledged by the Condor team; a fix is being prepared. Meanwhile a workaround has been deployed at CERN. * We are re-publishing data into APEL from April 2016 on. The new accounting summaries originate from the new system, and are hence expected to be more accurate. * During November about 5 PB have been recorded into CASTOR. For the p-Pb run, CASTOR ALICE is now using a Ceph-based pool in addition to the standard disk resources. Progressively all the (re-)processing traffic has been moved to Ceph completely replacing a 1.5 PB staging area. * Some instabilities in EOS have been observed and handled. * FTS has been upgraded to 3.5.7. * The network intervention on November 2nd did not cause any major disruption on the Tier-0 services. * A block corruption affected a table in COSMIC@LCG database system, which caused a downtime of the LCG database. Within four hours it was fully restored. * Backup is being rolled out for applications hosted in the Hadoop service. * An issue leading to long delays replicating the CMSONR database from P5 to the computer centre has been worked around by enabling compression. * A development instance of Apex 5 has been made available. * Progress is being made on the link between the GEANT PoP in Budapest and the Wigner data centre; we can thus expect the 3rd link to be commissioned soon. * We received an unexpected request to postpone maintenance work on the Wigner link until after the last week of heavy ion running in case it led to problems with data recording. Experiments should not be relying on the availability of this link to take data, as we cannot exclude situations where traffic would not be able to reach above 100 Gbps. * (For information - not a Tier-0 topic) Orange have completed optimisation of their network in the Pays de Gex and have good to very good coverage for 2G/3G/4G services. Some residual issues are being worked on. ---++ Tier 1 Feedback * NDGF-T1 recovered all but 8 files from the +35TB/1.2 million files from a broken raidset (3 disks lost from a RAID-6). Real cause was actually SAS cable or disk backplane, so the controller just lost contact with the 3 drives. Parts replaced, everything working again. ---++ Tier 2 Feedback ---++ Experiments Reports ---+++ ALICE * HI data taking progress as planned * Recorded data is being calibrated and quality-checked quasi-online * High to very high grid activity on average * Several incidents with job failures related to the Stratum-1 at ASGC: * Experiment package updates may be absent or late, presumably due to network issues * Also observed by ATLAS * The biggest ALICE sites in Asia have bypassed the Stratum-1 at ASGC to avoid random job failures * The CVMFS developers intend to make the client yet more robust in failing over to other <br /> instances when the default Stratum-1 cannot deliver a requested file for whatever reason * They also provided test commands that a pilot could run to check the health <br /> of the local CVMFS setup before committing itself to a user payload * The Stratum-0 hosts have been put on the OPN for better network connectivity * That appears to have improved the situation for ASGC * Intermittent staleness has also been observed for other Stratum-1 instances * In particular for =alice-ocdb.cern.ch= which is updated many times per day * Will be followed up further Maarten highlighted the issues met with Stratum-1. Stratum-0 hosts are now on the OPN. Performance should be better. ---+++ ATLAS * Smooth data taking and processing for Heavy Ion run. * Many competing high-priority activities on the grid; MC production, derivation production for Moriond, Upgrade simulation, BPhys_delayed stream processing (ran successfully from tape without pre-staging), Release 21 validation. * Tier0 is running beam spot reprocessing. * Task force to study derivation production throughput was formed; CPU efficiency, memory usage, job settings, output merging options (slow vs fast) are being studied. * Migration of group datasets to tape ongoing. 2.7PB to be put on tape. * CVMFS service at TAIWAN was getting behind regularly in updating to latest version of contents, running fine this week. Nobody present or connected from ATLAS. ---+++ CMS * Heavy Ion Run * No major issues on the computing side * Recording data close to the DAQ limit in order to collect maximum number of events * Monte Carlo production * DigiReco jobs that in part also use the premixing technique * Moved T2 fair-share target temporarily from 50%:50% to 75%:25% to favor Moriond 2017 MC production over analysis * DigiReco part of the Moriond 2017 production use the pileup premixing technique * Requesting sites to adjust cleaning procedure of /store/unmerged to accommodate long lasting workflows * !PhEDEx agents mandatory upgrade to version 4.2 is ongoing * GlideinWMS: two dedicated hi-performance VMs provided by CERN, being evaluated in Global Pool scalability test * CMSWEB: scaling issues being investigated * One bad worker node at GRIF produced corrupted files. Quite peculiar problem under deep investigation CMS GGUS:125142 * DashBoard outage two weeks ago GGUS:125037, GGUS:125089 Andrea M. asked if CMS is able to move to the new FTS3 client. Stephan Lammel confirms that all CMS sites are asked to move to FTS3 before the end of the year. Christoph said there was a need for !PhEDEx upgrade before. Open Action. CMS will open GGUS tickets to sites. Check-point in January. ---+++ LHCb * All activities progressing well * Data taking continues, transfers from the PIT are ok * Heavy Ion processing ramping up * Some MC jobs running though waiting on more requests from the collaboration ---++ Ongoing Task Forces and Working Groups ---+++ Accounting TF * Task force proposal for changes in the accounting portal and accounting reports has been approved by the MB * Changes will be implemented and validated during next week * For November two sets of Reports will be sent around, current ones and new ones (generated by the portal). * In January we should be able to switch to the new reports Julia emphasised that Ivan Diaz, developer, is progressing well. The November reports will be sent in December in two versions. ---+++ Information System Evolution <br />%INCLUDE{ "EGEE.WLCGISEvolution" section="20161201" }% Maria A. highlighted the point on REBUS. ---+++ IPv6 Validation and Deployment TF <br />%INCLUDE{ "WlcgIpv6" section="20161201" }% No report. ---+++ Machine/Job Features TF * DB12 benchmark included in mjf-scripts distribution for all batch platforms ---+++ Monitoring Full report later today. ---+++ MW Readiness WG <br />%INCLUDE{ "MiddlewareReadinessArchive" section="20161201" }% ---+++ Network and Transfer Metrics WG <br />%INCLUDE{ "NetworkTransferMetrics" section="01122016" }% Marian highlighted the 10 Jan 2017 pre-GDB. Please send comments to the WG. LHCb and CMS to announce participants. Invitation was sent to the experiment computing coordinators. CMS participation is highly desired. ---+++ RFC proxies * The next major version of ARC will __not__ support _legacy_ proxies! * expected to be released early next year * see Mattias Wadenstein's [[http://indico.cern.ch/event/394788/contributions/2357322/attachments/1367917/2073109/20161109-LegacyProxies.pdf][presentation]] in the [[http://indico.cern.ch/event/394788/][Nov GDB]] * RFC proxy failures for some services and some CAs (GGUS:124650) * certificates of affected CAs have the non-repudiation flag set * so far only !GridCanada certificates were seen to be affected * there exist more such CAs * affected services are dCache SRM < 2.14 and !BeStMan/EOS SRM * the consensus is that the fault lies in JGlobus, used by both * JGlobus is not officially supported by anyone these days * the dCache team and OSG are looking into "private" builds with an easy fix * !VOMS clients 3.0.7 and YAIM core 5.1.3 are fully available now * UMD-3 updated Nov 7 * UMD-4 updated Nov 10 for SL6 * EPEL 7 has !VOMS for EL7 * WLCG repo has YAIM core for EL7 * RQF:0675511 has been opened to get =lxplus= fixed (not yet done): * latest !VOMS clients * correct =GT_PROXY_MODE= environment variable * Experiments may want to check where legacy proxies are still being used * and switch those areas to RFC proxies at their convenience Maarten emphasised the Nov. presentation by Mattias. The JGlobus issue described in the GGUS ticket is still pending but followed-up by the dCache developers and OSG. Also the !VOMS client news and the ticket for lxplus to be updated. This TF can be closed now. ---+++ Squid Monitoring and HTTP Proxy Discovery TFs * CMS is now using the WLCG WPAD service in production, at about half a dozen opportunistic sites * The service was slightly expanded to cover all the organizations that have a squid in a multi-organization site (such as the ATLAS Midwest Tier2), with the closest squid first Dave not connected. ---+++ Traceability and Isolation WG ---++ Theme: Lightweight sites' survey results by Maarten Maarten walked through the slides, which are linked from the agenda, with the key dates of the survey and reminders. 51 sites responded so far. Two sites responded twice with different contents each time. Two T1s covered also the T2s they represent. Some T3s and/or sites with potential participation in WLCG ("tourist" sites) also replied. The questionnaire is still open and will remain for a few more weeks. There is at least one site that has a very different configuration from others, e.g. no CE, only based on VMs. Some questions may have been misunderstood by the sites, though each question came with some explanatory context. Kudos to the listed sites for taking the survey! The details of individual answers will not be publicly disclosed. Puppet is not used everywhere, e.g. some sites prefer Ansible. Both Marias said [[http://wlcg-ops.web.cern.ch/][the WLCG Ops portal]] can be used (comment on slide 7 about 'Improve documentation') to carry *links* to the existing locations (twikis or other pages). The removal of obsolete documentation is on Maarten's virtual to-do list. *NB!* Previous survey led to a request from sites for an entry point to documentation, this is why we developed the wlcg-ops portal. See https://twiki.cern.ch/LCG/WLCGSiteSurvey ---++ Theme: HTCondor Accounting by John Gordon John walked through the slides, which are linked from the agenda. Julia asked who coordinates the different activities. John said _nobody_. It would be good to form a Task Force for following this. Julia and Maria A suggested this issue should be discussed offline as this is not only an Accounting issue but a more general HTCondor deployment follow-up. Pepe asked when this work will start so that PIC can adopt the same solution. They will be in touch offline as well. ---++ Theme: Sites data in the new monitoring (MONIT) by Alberto Alberto walked us through the slides, which are linked from the agenda. What is shown in the Unified Monitoring slides is available already or, for some parts, very soon. The variety of tools adopted (kibana, grafana etc.) was needed due to their complementarity in functionality. John asked if the FTS data come from sites other than the FTS server at CERN. The answer is yes it comes from all the FTS servers and overs all the WLCG transfers. To have access one has to have a CERN login. Alberto asks for volunteer site managers for making such dashboards. Alessandra asked for a simple view for sites so that kibana knowledge is not needed. Maarten suggested that ATLAS and CMS, the main dashboard users, ask a couple of their sites to assist Alberto in configuring standard dashboards to spare other users the kibana internals. Stephan asked for some standard views, which would be useful for CMS. Julia said the Site Status Board requires a lot of data aggregation. Alberto said SAM aggregations will be started towards end of Q1 2017. ---++ Action list | *Creation date* | *Description* | *Responsible* | *Status* | *Comments* | | 01.09.2016 | Collect plans from sites to move to EL7 | WLCG Operations | On-going | The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF will stay on SL6 for now but they plan to go directly to EL7 early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress. | | 03.11.2016 | Review VO ID Card documentation and make sure it is suitable for multicore | WLCG Operations | pending | searching the doc location and editor. Is it in the EGI wiki? | | 03.11.2016 | Discuss internally on how to follow up long term strategy on experiments data management as raised by ATLAS | WLCG Operations | Pending | | | 03.11.2016 | Check status, action items and reporting channels of the Data Management Working Group | WLCG Operations | Pending | Julia gives an update of behalf of Oliver | ---+++ Specific actions for experiments | *Creation date* | *Description* | *Affected VO* | *Affected TF* | *Comments* | *Deadline* | *Completion* | | 29.04.2016 | Unify HTCondor CE type name in experiments VOfeeds | all | - | Proposal to use HTCONDOR-CE. In progress for ALICE. Raja will ask the status for LHCb. | | Ongoing | | 03.11.2016 | Proposal for advance warning of long site downtimes | All | - | [[https://twiki.cern.ch/twiki/pub/LCG/WLCGOpsMinutes161201/LongShutdowns.pptx][Proposal]] from WLCG Ops ready. Please, check and give feedback | 5th January 2016 | Proposal DONE. Waiting for feedback from Experiments | | 01.12.2016 | Open tickets to sites for moving to FTS3 client | | | There are !PhEDEx prerequisites | Year End 2016 | January 2017 | ---+++ Specific actions for sites | *Creation date* | *Description* | *Affected VO* | *Affected TF* | *Comments* | *Deadline* | *Completion* | | 01.12.2016 | Proposal for advance warning of long site downtimes | All | - | Please, give feedback to this [[https://twiki.cern.ch/twiki/pub/LCG/WLCGOpsMinutes161201/LongShutdowns.pptx][proposal]] | 5th January 2017 | In progress | ---++ AOB -- Main.MariaDimou - 2016-11-17
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
pptx
LongShutdowns.pptx
r1
manage
1678.3 K
2016-12-01 - 15:32
MariaALANDESPRADILLO
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r48
<
r47
<
r46
<
r45
<
r44
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r48 - 2018-02-28
-
MaartenLitmaath
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback