TWiki
>
LCG Web
>
WebPreferences
>
WLCGOpsMinutes151203
(2018-02-28,
MaartenLitmaath
)
(raw view)
E
dit
A
ttach
P
DF
---+!! WLCG Operations Coordination Minutes, December 3rd 2015 %TOC{depth="4"}% ---++!! Highlights * Please register to the https://indico.cern.ch/e/WLCG-Workshop-Lisbon-2016 * RedHat just provided a fix for openldap, to avoid current crashes affecting Top BDII and ARC-CE. Tests of this fix can now start. * The Information Systems (!InfoSys) TF Future Use Cases [[https://espace.cern.ch/WLCG-document-repository/Technical_Documents/WLCGFutureISUseCases_1.6.pdf][Document]] is now ready in the WLCG Document Repository. * The French and Spanish sites can now have host certificates compliant with the Globus change in validation; Spain will use the Terena CA, France has introduced the feature in its CA * MJF and Infosys TFs are harmonising their position about definitions for number of cores and HS06 values * VOMS is not working with IPv6, probably due to a relatively recent change (GGUS:117987). ---++ Agenda * https://indico.cern.ch/event/393621/ ---++ Attendance * local: Andrea Sciabà (chair), Maria Dimou (minutes), Maarten Litmaath, Andrea Manzi, Marc Slater, Jerôme Belleman, Maria Alandes. * remote: Alessandra Doria, Michael Ernst, Christoph Wissing, Dave Mason, Vincenzo Spinoso, Peter Gronbech, Andreas Petzold, Andrew !McNab, Di Qing, Massimo Sgaravatto, Ulf Bobson Severin Tigerstedt, Josep (Pepe) Flix, Julia Andreeva, Andrea Valassi, Gareth Smith, Antonio Yzquierdo, Jeremy Coles (part). * apologies: Catherine Biscarat, Alessandro Di Girolamo, David Cameron ---++ Operations News * WLCG workshop: Please register if you plan to go (https://indico.cern.ch/event/433164/overview). ---++ Middleware News * Useful Links: * [[WLCGBaselineVersions][Baseline Versions]] * [[WLCGBaselineVersions#Issues_Affecting_the_WLCG_Infras][MW Issues]] * [[WLCGT0T1GridServices#Storage_deployment][Storage Deployment]] * Baselines: * NTR * Issues: * Some news from RedHat regarding the openldap crash affecting Top BDII and ARC-CE. They just provide us today the fix to test. We will then communicate any news during the next ops or ops coord meetings * T0 and T1 services * JINR * Minor postgres upgrade to 9.4.5 * PIC * Major dCache upgrade to v 2.13.13 ---++ Tier 0 News * Working on adding more capacity to Condor cluster. * !LSF 9 software deployed on all worker nodes for both the ATLAS T0 instance and the main one. ---+++ DB News ---++ Tier 1 Feedback * CC-IN2P3 (France, C. Biscarat) - Globus host certificate validation change - our CA is now able to deliver compliant certificates, allowing to declare alias in the AltName. We are in the process of changing the certificate on the affected hosts (2 SRM servers). * PIC (Spain, J. Flix) - Globus host certificate validation change - PIC is now using TCS (Terena) certificates, and we solved the issues with all of the host with aliases already. We are in the process to migrate the remaining machines to TCS certificates, which will happen rather soon. * PIC (Spain, J. Casals) - Fixed some issues in the Nagios-plugin for SAM3: https://github.com/jcasals/nagios-plugins-lcgsam * NDGF-T1(Nordics, Tigerstedt) will upgrade dCache to 2.14.x on 14.12.2015, full day outage. ---++ Tier 2 Feedback ---++ Experiments Reports ---+++ ALICE * generally normal to high activity * so far the heavy ion run has been smooth from the grid perspective! * reco jobs run very successfully * their RSS memory consumption has remained up to max ~2.5 GB * we have to see what happens at the planned higher beam intensities * this has allowed the use of normal queues at various T1 * at CERN the fraction of two-core jobs is being lowered in steps * CERN: submission to HTCondor CE in production since yesterday evening * CERN: TEAM ticket GGUS:118062 opened Monday evening. ALICE was severely impacted by an !OpenStack issue: * the standard build system could not be used to release analysis updates * a local mini build system was put together for the most urgent cases * thanks to the !OpenStack team for solving the complex issue as fast as possible! ---+++ ATLAS * !HeavyIon data taking: in general everything OK, nothing to be particularly worry about. Tier-0 performance in terms of events/second reconstructed from the whole cluster are quite low (few tents of Hz), observed huge I/O wait in Wigner spinning disks nodes. Those nodes now have been configured by the ATLAS Tier-0 to run less jobs than what is their standard batch configuration, and the performances improved. * Reprocessing test now ongoing: the plan is to launch a full reprocessing campaign the 14th of December. Plan is quite tight since there are still problems with the release, please stay tuned on the Tuesdays ADC weekly next weeks. ---+++ CMS * Heavy Ion Run is ongoing * Permission problem in EOS: GGUS:118027 * Issue with mapping * Fixed by EOS team during the weekend - Thanks! * Some storage pools disappearing from the network: GGUS:118082, GGUS:118037 * Investigated by CERN storage and network teams * Only seen by CMS? * CMS Tier-0 workflows is driving some CERN Openstack hardware to its limits: GGUS:118056 * Staging problem at KIT: GGUS:117910 * Let to too many queued transfers within the CMS transfer system * Quite some overall performance degradation, also affecting transfers where KIT is not involved * Situation is improving (at KIT and globally) * Had a little ticketing campaign for DPM sites to move to DPM 1.8.10 * Earlier versions have issues with recent global/regional redirectors ---+++ LHCb * Operations * Currently processing pp reference run * Finished 13TeV pp data processing * Will be starting processing of Heavy Ion runs soon * Significant MC generation in-coming * Issues * Problems with user accessing files at !IN2P3 were experience last Tuesday pm. Assumed to be down to CA issues as they went away at the same time but more likely just coincidentally fixed at the same time (GGUS:118077) * Problem with RRCKI tape put offline and preventing access from certain files now solved * Developments * MC simulation workflows have been executed successfully on commercial clouds, on both DBCE (up to 600 simultaneous jobs running) and Azure (up to 1000 simultaneous jobs running, high rate of stalled jobs under investigation). ---++ Ongoing Task Forces and Working Groups ---+++ gLExec Deployment TF * NTR ---+++ Machine/Job Features TF * Ongoing discussions clarifying key/value pairs: some changes, some expanded definitions * Attempting to be consistent with WLCG Information Systems Evolution TF * 2nd draft of HSF technical note to record the communication procedure and the key/value pair definitions * Deployed at several batch sites, and many VM-based installations (all the ones using Vac/Vcycle) * DIRAC reading time limit information from MJF in LHCb pilot jobs and pilot VMs. * Next steps to review experience with implementations and installations, and update in view of technical note discussions. ---+++ HTTP Deployment TF ---+++ Information System Evolution <br />%INCLUDE{ "EGEE.WLCGISEvolution" section="20151203" }% Alessandra Doria (Napoli) expressed the sites' appreciation for the TF's definitions' dissemination via the lcg-rollout mailing list and the poll for feedback from the sites. ---+++ IPv6 Validation and Deployment TF <br />%INCLUDE{ "WlcgIpv6" section="20151203" }% * The NAGIOS service set-up is still being tuned. * VOMS still doesn't work with !IPv6. There is a ticket to follow this up. No problem with voms-admin. * The ARGUS - !IPv6 status was discussed at the ARGUS collaboration meeting on December 2nd. Extract from [[https://indico.cern.ch/event/465818/][the minutes]]: _IPv6 support: not really tested but no problem expected as Java has a good IPv6 support and as ARGUS is binding to all interfaces/addresses. Sharing a lot of network code with VOMS Admin that is IPv6 compliant: the only known issues are with VOMS that is not using the same code._ ---+++ Middleware Readiness WG <br />%INCLUDE{ "MiddlewareReadinessArchive" section="20151203" }% Full minutes are [[https://twiki.cern.ch/twiki/bin/view/LCG/MWReadinessMeetingNotes20151202][here]] with details per product and site. The new verified top-BDII is now in UMD (done by EGI). ---+++ Multicore Deployment %INCLUDE{ "MulticoreTFReports" section="03122015" }% ---+++ Network and Transfer Metrics WG <br />%INCLUDE{ "NetworkTransferMetrics" section="03122015" }% ---+++ RFC proxies * CMS have switched test pilot factories to RFC proxies ---+++ Squid Monitoring and HTTP Proxy Discovery TFs * Nothing to report ---++ Action list | *Creation date* | *Description* | *Responsible* | *Status* | *Comments* | | 2015-06-04 | Status of fix for Globus library (=globus-gssapi-gsi-11.16-1=) released in EPEL testing | Andrea Manzi | ONGOING | GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 15 GGUS tickets opened for SRM and Myproxy certificates not correct, 6 already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). 5-6 tickets still open at the 2015-12-03 meeting. All for sites which have no technical issues to proceed. | | 2015-10-01 | Follow up on reporting of number of processors with PBS | John Gordon | CLOSED | Everyone uses the development instance. | | 2015-10-01 | Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites | SCOD team | ONGOING | A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting | ---+++ Specific actions for experiments | *Creation date* | *Description* | *Affected VO* | *Affected TF* | *Comments* | *Deadline* | *Completion* | ---+++ Specific actions for sites | *Creation date* | *Description* | *Affected VO* | *Affected TF* | *Comments* | *Deadline* | *Completion* | | 2015-11-05 | ATLAS would like to ask sites to provide consistency checks of storage dumps. [[http://go.web.cern.ch/go/C9xr][More information]] and [[https://indico.cern.ch/event/445782/contribution/13/attachments/1180967/1709690/proposal_to_sites.pdf][More details]] | ATLAS | - | Status not clear at the 2015-12-03 Ops Coord meeting (ATLAS absent) | None | - | ---++ AOB -- Main.MariaDimou - 2015-12-01
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r37
<
r36
<
r35
<
r34
<
r33
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r37 - 2018-02-28
-
MaartenLitmaath
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback