WLCGOpsMinutes160107 < LCG

LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes160107 (2018-02-28, MaartenLitmaath) (raw view)
EditAttachPDF
---+!! WLCG Operations Coordination Minutes, January 7th 2016

%TOC{depth="4"}%

---++!! Highlights
   * [[https://indico.cern.ch/event/433164/][WLCG workshop]]: Registration closes on 22nd January.
   * The HTTP TF will be able to close when 90% of the sites will show correctly configured without interrupt for over a week. The TF opened GGUS tickets to give the sites [[https://twiki.cern.ch/twiki/bin/view/LCG/HTTPTFSAMProbe][all relevant instructions]].
   * The  Multicore Deployment TF announced that WLCG users should mainly use the [[http://accounting.egi.eu/tier1.php][Tier1 and Tier2 views]] which now use the same data as the production portal (ie include cores). 

---++ Agenda

   * https://indico.cern.ch/event/466816/

---++ Attendance
   * local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath, David Cameron, Andrea Manzi, Jerome Belleman, Julia Andreeva, Gavin McCance, Helge Meinhard, Oliver Keeble, Marian Babik, Xavier Espinal, 
   * remote: Michael Ernst, Christoph Wissing, Jeremy Coles, Massimo Sgaravatto, Catherine Biscarat, Alessandra Doria, Gareth Smith, Rob Quick, Ulf Tigerstedt, Alessandra Forti, Di Qinq, Renaud Vernet, Dave Mason, Daniele Bonacorsi, Antonio Yzquierdo, Josep Flix, Zoltan Mathe (LHCb), Javier Sanchez, Federico Melaccio, Anton Gamel, B. Jashal (T2_IN_TIFR). 
   * apologies: Vincenzo Spinoso (EGI)
 
---++ Operations News
   * Andrea Sciaba has stopped working in WLCG Operations. Many thanks for his valuable contribution! Maria Dimou and Maria Alandes will remain part of the Operations Coordination team at CERN, together with Pepe and Alessandra. 
   * [[https://indico.cern.ch/event/433164/][WLCG workshop]]: Registration closes on 22nd January.
   * Memory limits for batch queues: At the [[https://espace.cern.ch/WLCG-document-repository/Boards/MB/Minutes/MB-Minutes-151027-v3.pdf][MB of 27.10.2015]], it was decided to put an action on WLCG Operations to produce a set of recipes about how to best configure memory limits for batch queues. Operations coordination will open a set of GGUS tickets to a selection of sites (mostly T1s and a few T2s). Please, be ready to provide the necessary input. Thanks in advance.

---++ Middleware News

   * Useful Links:
      * [[https://wlcg-mw-readiness.cern.ch/baseline/current/][Baseline Versions]]
      * [[WLCGBaselineVersions#Issues_Affecting_the_WLCG_Infras][MW Issues]]
      * [[WLCGT0T1GridServices#Storage_deployment][Storage Deployment]]
   * Baselines:
      * As reported before the holidays, a problem affected dCache pool version >  2.12 when using Berkeley DB as metadata backend which could lead to data loss.
        For this particular installations, baselines are now dCache 2.12.28, 2.13.6, 2.14.5 and upgrade details circulated by dCache devs are  available at :  https://twiki.cern.ch/twiki/pub/LCG/WLCGBaselineVersions/dcache-bug.txt
   * Issues:
      * EGI advisory regarding Kernel vulnerability for SL6 : https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2015-CVE-2015-7613. Sites are encouraged to update to the kernel released on Dec 15th     
   * T0 and T1 services:
      * JINR
         * dCache upgraded to v 2.10.48
      * Triumf
         * dCache upgraded to v 2.10.44
    
Maria Alandes asked whether !Redhat released the openldap fixes that we tested successfully. The answer is 'not yet'.

---++ Tier 0 News

   * Condor: 86 kHS06 &#8594; 96 kHS06 out of a total of 784 kHS06 since 10 Dec

---+++ DB News

---++ Tier 1 Feedback
   * 
---++ Tier 2 Feedback

---++ Experiments Reports

---+++ ALICE

   * *Best wishes for 2016!*
   * Normal to high activity levels during the break
      * *Thanks* to the sites for keeping things in good shape!
      * The first round of the heavy-ion reconstruction finished!
   * CASTOR issues
      * Dec 18: alarm ticket GGUS:118443 because the transfer manager was stuck
         * Fixed later that afternoon, thanks!
      * Dec 31: team ticket GGUS:118554 because of same problem
         * OK again since Jan 1 00:00, thanks!
      * Jan 5: team ticket GGUS:118619 ditto
         * debugged live by the devs
         * root cause was not found yet
   * EOS issues
      * Dec 31: team ticket GGUS:118559 for EOS at CERN
         * Partly due to EOS-ALICE being ~full !
         * Some disk servers were unavailable
         * Mitigated by the admins, thanks!
   * KIT
      * Dec 31: tape SE working again, thanks!

---+++ ATLAS
   * Smooth operations over the whole xMas break, almost steadily between 230-250k running parallel slots.
      * RAL had a few issues with their storage: GGUS:118451, GGUS:118573, GGUS:118631
      * The NET2 network interface was saturated because we put too much RAW data there for reprocessing
   * Reprocessing:
      * Almost completely finished the whole reprocessing campaign (around 1.8PB of RAW input data) during the Xmas break. 
      * This is quite a remarkable result, in the past comparable reprocessing campaigns took 4-6 weeks.
      * Thanks to the effort of the sites which were extremely stable during the xmas period and to some experts who made sure that the few issues were quickly understood and solved.
   * FTS3:
      * Another possibly quite dangerous bug.
      * Noticed few lost files (registered in Rucio but not on storage) on Monday, it took few days to understand it, today an email has been sent to the FTS devels.
   * Minor: some sites noticed that some jobs (very few, event generation using MadGraph library) were creating troubles to the WN where they run.
      * This is because they produce large amount of outputs and the output log tarball contained data files.
      * The problem has been understood, there is the need of a fix in the ATLAS transformation which can take few weeks to be done and be put in production, so we decided also to add some "safety" on the pilot which will make sure that this problem will be caught before it will create trouble on the WNs. This fix will be most probably released in one week/10days from now.

Maria Alandes asked which were the reasons for the reprocessing time reduction. The reasons are multiple: many more cores, better network performance and improved software quality.

---+++ CMS

   * Happy New Year to everyone!
   * Rather high production load over !Xmas break
      * Run more 100k jobs in parallel at many days
      * HLT (High Level Trigger) contributed a few thousand cores
      * No major issues
   * Tier-0 / PromptRECO
      * Backlog of pending jobs not fully cleared during the break
      * Partly due to lacking resources at CERN
      * Needed help from experts to provision fresh VMs GGUS:118546
   * Tape operations
      * Had a rather long backlog of not approved tape migrations at FNAL before !Xmas break
         * Sorted out via CMS site contacts
      * Some datasets not moving at RAL
         * Improved now
         * Details: GGUS:118549

---+++ LHCb
   
   * Activities:
      * Monte Carlo and user analysis.
      * Pre-staging the data for re-stripping is almost finished.

   * Issue:
      * Problem pre-staging files at RRCKI 
      * Nickname VOMS attribute can not be retrieved (GGUS:118361)

There was a discussion on the reasons why the above ticket has no activity since Dec. 16th and status "On hold". It should be followed up by !LHCb offline.

---++ Ongoing Task Forces and Working Groups

---+++ gLExec Deployment TF

   * NTR

---+++ Machine/Job Features TF

   * 

---+++ HTTP Deployment TF

   * ETF is up and running in preprod 
      * Atlas [[https://etf-atlas-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fview_name%3Dservicegroup%26selection%3Dc7bf885f-ed06-4a1c-af0e-d028b8f922fd%26optservice_group%3DHTTP%2520TF%2520Overview%26servicegroup%3DHTTP%2520TF%2520Overview%26mode%3Davailability][results]]
      * LHCb:[[https://etf-lhcb-preprod.cern.ch/etf/check_mk/index.py?start_url=%2Fetf%2Fcheck_mk%2Fview.py%3Fview_name%3Dservicegroup%26selection%3Dc7bf885f-ed06-4a1c-af0e-d028b8f922fd%26optservice_group%3DHTTP%2520TF%2520Overview%26servicegroup%3DHTTP%2520TF%2520Overview%26mode%3Davailability][results]]
   * A first set of tickets, around 20, have been assigned to sites.
      * http://cern.ch/go/h8Kl
   * The next TF meeting has been confirmed as 20th Jan - https://indico.cern.ch/event/473194/
      * The meeting will concentrate on setting up the operational plan for the campaign to get the monitoring green.

---+++ Information System Evolution

<br />%INCLUDE{ "EGEE.WLCGISEvolution" section="20160107" }%

---+++ IPv6 Validation and Deployment TF

<br />%INCLUDE{ "WlcgIpv6" section="20160107" }%

---+++ Middleware Readiness WG

<br />%INCLUDE{ "MiddlewareReadinessArchive" section="20160107" }%

---+++ Multicore Deployment

%INCLUDE{ "MulticoreTFReports" section="07012016" }%

---+++ Network and Transfer Metrics WG

<br />%INCLUDE{ "NetworkTransferMetrics" section="07012016" }%

---+++ RFC proxies

   * NTR

---+++ Squid Monitoring and HTTP Proxy Discovery TFs

   * NTR

---++ Action list

| *Creation date* | *Description* | *Responsible* | *Status* | *Comments* |
| 2015-06-04 | Status of fix for Globus library (=globus-gssapi-gsi-11.16-1=) released in EPEL testing | Maarten | ONGOING | GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Jan 7 there are 2 tickets open: GGUS:117043 for *CNAF* (in progress) and GGUS:118371 for *FNAL* (in progress). Maarten will follow-up the progress of these tickets. They will be mentioned at the 3pm Ops call on Jan 11th |
| 2015-10-01 | Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites | SCOD team | CLOSE & Open New | A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting. Julia explains why implementing in the SSB a Google calendar also for future downtimes is not trivial at all. A reasonable compromise is to have a solution implemented in GOCDB and use the current simple links for the OSG T1's. It is then agreed to close the action and open one for the GOCDB team to implement the feature in GOCDB, as they already agreed to some time ago. Next year they will be contacted to define a reasonable timescale. At the Jan 7th meeting, Maria Alandes reported that she is in touch wiht GOCDB and more news will hopefully come next week. |
| 2015-12-17 | Recommend site configurations to enforce memory limits on jobs | | CREATED | 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F. Status of Jan 12th: A new twiki BatchSystemsConfig was finally decided as a better idea. Tickets opened. |


---+++ Specific actions for experiments
| *Creation date* | *Description* | *Affected VO* | *Affected TF* | *Comments* | *Deadline* | *Completion* |

---+++ Specific actions for sites

| *Creation date* | *Description* | *Affected VO* | *Affected TF* | *Comments* | *Deadline* | *Completion* |

---++ AOB

-- Main.MariaDimou - 2016-01-05
Topic revision: r41 - 2018-02-28 - MaartenLitmaath
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Welcome Guest
- Cern Search
- TWiki Search
- Google Search
LCG All webs
Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback