TWiki
>
LCG Web
>
WebPreferences
>
WLCGOpsMinutes151119
(2018-02-28,
MaartenLitmaath
)
(raw view)
E
dit
A
ttach
P
DF
---+!! WLCG Operations Coordination Minutes, November 19th 2015 %TOC{depth="4"}% ---++!! Highlights * All sites must patch their hosts for the NSS vulnerability as soon as possible, if they have not done so already. ---++ Agenda * https://indico.cern.ch/event/393620/ ---++ Attendance * local: Maria Alandes (chair), Andrea Sciabà (minutes), Maarten Litmaath, Maite Barroso Lopez, Andrea Manzi, Marian Babik, Alessandro Di Girolamo * remote: Alessandra Doria, Michael Ernst, Jeremy Coles, Christoph Wissing, David Cameron, Raja Nandakumar, Renaud Vernet, Thomas Hartmann, Dave Mason, Vincenzo Spinoso, Alberto Aimar, Peter Gronbech ---++ Operations News ---++ Middleware News * Useful Links: * [[WLCGBaselineVersions][Baseline Versions]] * [[WLCGBaselineVersions#Issues_Affecting_the_WLCG_Infras][MW Issues]] * [[WLCGT0T1GridServices#Storage_deployment][Storage Deployment]] * Baselines: * dCache 2.6.x decommissioned deadline was end of September. 11 instances are still running, 6 f them used in WLCG. i have discussed with EGI to open tickets to the sites still running old versions. https://wiki.egi.eu/wiki/Software_Calendars#Decommissioning_Calendar_dCache_2.6.X * Issues: * Critical Vulnerability broadcasted by SVG on Friday 06 affecting NSS. (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2015-CVE-2015-7183). All software where the SSL handshaking is based on Mozilla Network security services which includes RedHat 6 and 7 and its derivatives is affected ( for instance libcurl uses NSS). EGI CISRT put as deadline the 2015-11-13 for patching the hosts. Sites failing to act and/or failing to respond to requests from the EGI CSIRT team risk site suspension. * this is a problem affecting not only grid services, Security team @CERN has also sent this week an email to ask all service admins to patch their hosts * T0 and T1 services * KIT * dCache upgraded to v 2.13.9 * CERN * Every EOS deployments upgraded to EOS 0.3.135-aquamarine * JINR * dCache upgraded to v 2.10.44 ---++ Tier 0 News * The !LSF 9 upgrade of the WNs is in QA testing. The ATLAS Tier-0 !LSF instance is upgraded to v9, and the clients are also in QA, Atlas will decide when to upgrade them.. * The HTCondor capacity represents some 5% of the total batch capacity at CERN; we plan to rather quickly move more resources from !LSF to HTCondor to reach some 20 25%. The two ARC CEs are declared obsolete; * The Kilo-1 configuration that resulted from performance optimization work jointly done with the cloud team is now running on some 100 lxbatch hosts, so far with very satisfactory results and no indication of any unwanted effect. It will be extended to all hosts when the Openstack kilo release is deployed, estimated at the end of November. * IPv6 enabled in !MyProxy and !VOMS for testing purposes, in dual-stack mode (IPv4 and IPv6). Andrea S. reports that !VOMS does not work over IPv6 and he got confirmation from the main developer. He will open a GGUS ticket for this problem. ---+++ DB News ---++ Tier 1 Feedback ---++ Tier 2 Feedback ---++ Experiments Reports ---+++ ALICE * generally normal to high activity * preparations for heavy ion reco jobs: * important changes in the code and workflow have been implemented to reduce the memory usage * they were tested with 2011 heavy ion reference data * if all goes well, for this year's heavy ion data the reco jobs will only need ~2.5 GB RAM * to be on the safe side, special arrangements were made with the sites that will receive heavy ion raw data * CNAF, KISTI, KIT and SARA have set up dedicated high memory queues * at CERN the jobs can request 2 cores and hence have twice the memory * all setups have been tested with normal jobs * we thank the sites for the good support! Maarten adds that only real data taking will show how often events requiring a lot of memory will appear. It is understood that requesting two cores per slot will heavily affect the job CPU efficiency, but this is the price to pay. It might be that at CERN this will not be required but it is early to say. ---+++ ATLAS * Activity as usual * new record in parallel running slots: 250k . Thanks to the impact of opportunistic resources like !Sim@P1 and NERSC_Edison (together they contributed with more than 50k slots) * Frontier and Squid: during the past few days we observed that some of the jobs we are running now (mc15b campaign) are requesting an excessive amount of conditions data. This is creating troubles so some squids and Frontier servers. The problem is understood and fixed, no new tasks like this will be launched. For the existing ones, since they are almost over, we will let them finish * Heavy Ion data taking: we are ready for it. Since the processing time of HI is huge, we are ready to use the Tier1s/Tier2s to reconstruct also. * Deletion agents: deletion agents were switched off between Sunday night and Wednesday, to allow time to recover data which was scheduled for deletion but was actually needed by some people. Now the deletion agents have been restarted, but they are struggling to keep on with the high amount of deletions. * PRODDISK has been decommissioned on all the Tier2s (and Tier3s which wanted). ---+++ CMS * Preparations for Heavy Ion running continuing * No issues so for from the Computing side * Very high load in the system * Last week sustained ~120k parallel jobs * Multi-billion events MC RECO campaign ahead * Situation expected to stay like this for weeks ---+++ LHCb * Operations * Very high activities on distributed computing resources with user and simulation workflows * Some low levels of Data processing activities ongoing * LHCb will participate and take data in lead-ion runs until mid December * Issues * Several days of failures at SARA when srm was overloaded by a local user.(GGUS:117413, GGUS:117483) * Issues with tape movers at RRCKI (GGUS:117444, GGUS:117267) * Security vulnerability reported with LHCb setup script in CVMFS which is sourced before every workflow. Under investigation. * Development / Outlook * Working on interface to HTCondor-CE Raja and Maarten clarify that ALICE and LHCb are both in the same situation: their HTCondor-CE plugins are basically ready but not yet in production. Alessandro adds that ATLAS already submitted jobs to the CERN HTCondor-CEs and they are ready to be put in production. ---++ Ongoing Task Forces and Working Groups ---+++ gLExec Deployment TF * NTR ---+++ HTTP Deployment TF The 5th TF meeting took place on 11th Nov - https://indico.cern.ch/event/459419 Minutes are attached to the agenda. The TF now has a working Nagios probe, endpoint lists from the experiments, regular monitoring of the infrastructure (see links on agenda) and a GGUS support unit. The TF is thus ready to do a "dry run" of its principal activity, helping sites to get their HTTP storage in shape. In the next couple of weeks we will run with a small group of volunteer sites to test/optimise the process which will then be used to ticket and support all remaining sites. ---+++ Information System Evolution <br />%INCLUDE{ "EGEE.WLCGISEvolution" section="20151119" }% ---+++ IPv6 Validation and Deployment TF <br />%INCLUDE{ "WlcgIpv6" section="20151119" }% ---+++ Middleware Readiness WG <br />%INCLUDE{ "MiddlewareReadinessArchive" section="20151119" }% ---+++ Multicore Deployment %INCLUDE{ "MulticoreTFReports" section="19112015" }% ---+++ Network and Transfer Metrics WG <br />%INCLUDE{ "NetworkTransferMetrics" section="19112015" }% ---+++ RFC proxies * NTR ---+++ Squid Monitoring and HTTP Proxy Discovery TFs * Nothing to report ---++ Action list | *Creation date* | *Description* | *Responsible* | *Status* | *Comments* | | 2015-06-04 | Status of fix for Globus library (=globus-gssapi-gsi-11.16-1=) released in EPEL testing | Andrea Manzi | ONGOING | GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 15 GGUS tickets opened for SRM and Myproxy certificates not correct, 6 already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well) | | 2015-10-01 | Follow up on reporting of number of processors with PBS | John Gordon | ONGOING | | | 2015-10-01 | Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites | SCOD team | ONGOING | A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting | Maarten adds that, concerning the host certificate issue, it is under control or already solved for all sites but he is still awaiting for feedback from France and he will ping them again. ---+++ Specific actions for experiments | *Creation date* | *Description* | *Affected VO* | *Affected TF* | *Comments* | *Deadline* | *Completion* | ---+++ Specific actions for sites | *Creation date* | *Description* | *Affected VO* | *Affected TF* | *Comments* | *Deadline* | *Completion* | | 2015-11-05 | ATLAS would like to ask sites to provide consistency checks of storage dumps. [[http://go.web.cern.ch/go/C9xr][More information]] and [[https://indico.cern.ch/event/445782/contribution/13/attachments/1180967/1709690/proposal_to_sites.pdf][More details]] | ATLAS | - | - | None | - | ---++ AOB Andrew !McNab will take over the coordination of the Machine/Job Features task force. -- Main.AndreaSciaba - 2015-11-17
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r13
<
r12
<
r11
<
r10
<
r9
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r13 - 2018-02-28
-
MaartenLitmaath
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback