WLCGOpsMinutes151217 < LCG

LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes151217 (2018-02-28, MaartenLitmaath) (raw view)
EditAttachPDF
---+!! WLCG Operations Coordination Minutes, December 17th 2015

%TOC{depth="4"}%

---++!! Highlights
   * A critical bug was found in dCache and it may cause significant data loss. It affects versions 2.12.[0,27], 2.13.[0,15], 2.14.[0,4] if !BerkeleyDB is used as backend. All concerned sites MUST apply the fix described in https://twiki.cern.ch/twiki/pub/LCG/WLCGBaselineVersions/dcache-bug.txt as soon as possible!
   * All ATLAS sites that have not done it already should take action ASAP on the tickets they received to run storage consistency checks
   * Thanks to everybody for the good work in WLCG operations during 2015!

---++ Agenda

   * https://indico.cern.ch/event/393624/

---++ Attendance
   * local: Maria Dimou (chair), Andrea Sciab&agrave; (minutes), Maarten Litmaath, David Cameron, Andrea Manzi, Maite Barroso, Jerome Belleman, Julia Andreeva, Alessandro Di Girolamo
   * remote: Michael Ernst, Christoph Wissing, Jeremy Coles, Massimo Sgaravatto, Catherine Biscarat, Alessandra Doria, Gareth Smith, Rob Quick, Ulf Tigerstedt, Vincenzo Spinoso, Hung-Te Lee
   * apologies: Andrew !McNab (MJF TF)

---++ Operations News
   * The MB asked Operations Coordination to produce a recommendation on memory limits configuration for batch queues.
   * Our next meeting will take place on Jan. 7th 2016.
   * Workshop for HTCondor and ARC CE users in Barcelona, Spain on Feb 29 2016 – March 4 2016. Aimed at users and admins of HTCondor, HTCondor-CEs and ARC-CEs. Several talks and tutorials, meetings with the developers. Proposals for contributions can be sent to hepix-condorworkshop2016-interest (at) cern (dot) ch. More information at https://indico.cern.ch/e/Spring2016HTCondorWorkshop.

Maite announces that she'll no longer represent the Tier-0. A successor (or a rota of them) will be chosen early next year.

Concerning the recommendations for configuring memory limits: the motivation is to improve on the current situation, where sites have to find out without any guidance how to set up memory limits for jobs, which also causes experiments to observe inconsistent behaviours among different sites. It has also implications on purchasing new hardware.

After some discussion, it was agreed that the first step will be to collect from sites information on their current setup (e.g. from the Tier-1s and any willing Tier-2s) and put it in a twiki. Tickets will be used only in case of insufficient feedback. Finally, recommendations will be given, depending on the batch system used and any other relevant factor.

Alessandro mentions that the HTCondor/ARC workshop will overlap with the ATLAS software week.

---++ Middleware News

   * Useful Links:
      * [[https://wlcg-mw-readiness.cern.ch/baseline/current/][Baseline Versions]]
      * [[WLCGBaselineVersions#Issues_Affecting_the_WLCG_Infras][MW Issues]]
      * [[WLCGT0T1GridServices#Storage_deployment][Storage Deployment]]
   * Baselines:
      * gfal2 2.9.3 and gfal2-utils 1.2.1 are now baselines. They introduced bug fixes and enhancements:
         * https://dmc.web.cern.ch/release/gfal2-2.9.3
         * https://dmc.web.cern.ch/release/gfal2-util-1.2.1
      * SL5 decommissioning. As also presented during the GDB, EGI set as deadline for the decommissioning of SL5 services the 30 April 2016. So for the WLCG sites part of EGI still running SL5 services, we suggest to start planning the upgrade to SL6/CentOS7

   * Issues:
      * As reported by Ulf, a quite serious bug  is affecting dCache v 2.12/2.13/2.14. Today dCache released a patch, we will contact the sites with more details ASAP.
      * Good news regarding the openldap crash affecting Top BDII and ARC-CE. A new set of rpms has been provided by RedHat, and tested at CERN, DESY and by ARC devs. The issue seems to be finally solved so we are now pushing RedHat to release the new version ASAP
      * A quite big issue affecting gfal2-2.10.2 ( copy to/from SRM failed when using BDII resolution) has been discovered only in production. A fix has been immediately pushed to EPEL stable ( gfal2.-2.10.3)

Maarten adds that, for the dCache bug, most likely not all sites are affected, as it depends on the local configuration. Also, dCache site admins normally are subscribed to the dCache admin forum and would thus have already been informed of all the details. Still it would be good to send a WLCG broadcast about it (done by Andrea M).
Alessandro mentions that NDGF was severely hit by it because it happens when doing dist-to-disk copies, and many disk servers were being decommissioned.
A broadcast will be sent just after the meeting

   * T0 and T1 services
      * ASGC
         * Castor Decommissioning planned for the end of the year
      * CNAF
         * plan to upgrade to Storm to 1.11.0 when released and move to the new storm-webdav from storm-http
      * IN2P3
         * dCache upgraded to v 2.13.4
      * NDGF
         * dCache upgraded to v 2.14.4

---++ Tier 0 News
Jerome reports that now the HTCondor pool has more than 85 kHS06 of computing power (corresponding to about 10K slots).

---+++ DB News

---++ Tier 1 Feedback
   * NDGF-T1 had a good update to dCache 2.14 on Monday, but then noticed a bug.. that had been introduced into dCache 2.12.0 in January. It causes files that have been moved around within the storage system to lose the stickiness flag, marking the files available for garbage collection. This is ok (and default behaviour) for tape files, but not for disk files. So far we know of 1428 lost files. Alice and Atlas will get a list of files at some point this week. (Ulf writing in since it's unclear if I can attend the meeting due to travel). The bug has been fixed in dCache, and affects 2.12, 2.13 and 2.14 releases.

Ulf adds that also PIC and some German sites were hit by this bug.

---++ Tier 2 Feedback

---++ Experiments Reports

---+++ ALICE

   * The heavy ion data taking has ended successfully!
      * Reconstruction and reprocessing will continue for many more weeks
      * The RSS memory usage has remained up to max ~2.5 GB
      * High-memory arrangements were undone also at KISTI and KIT
         * to allow more job slots to be used again, thanks!
   * The CASTOR team then rearranged the ALICE disk servers into a single pool:
      * to allow convenient usage of all available resources, thanks!
   * Grid activity has been high
   * Expectations for the end-of-year break:
      * steady MC production
      * heavy ion reconstruction
      * low analysis activity

   * *Thanks* to all sites and experts for another successful year!
   * Season's greetings and best wishes for 2016!

---+++ ATLAS
   * During xmas break
      * Computing activities never stop: support will be on best effort basis.
         * In particular we started, as announced few weeks ago, a reprocessing of the 2015 data. 1.8 PB of RAW, prestaged already on disk to minimize the burden on Tier1s
      * please check the slides we presented yesterday for more details. https://indico.cern.ch/event/460652/contribution/12/attachments/1205811/1757828/xmascoverage2015.pdf
   * FTS: we are still suffering of critical issues. This time, yesterday, it's most probably related to the high prestaging activity (to prestage data for reprocessing) . We have asked FTS devs to clarify what is the best course of actions to minimize the issues over the xmas break.
      * DDM has implemented an automatic restart in case of issues from FTS: this will just mitigate the issue.
   * Storage Consistency checks: dear sites, please answer to the GGUS ticket. Overview: total GGUS tickets submitted approx 130, 80 closed/verified, 50 still open, 30 of which without any answer yet!!
   * Merry Xmas, happy new year: super thanks to everybody for making a such successful year!

Alessandro explains that the problem with FTS is that there is an extremely high limit on the number of files in a prestaging request, and requests with too many files will slow down and possibly "collapse" FTS or the SE. Andrea M. adds that the latest patch (3.4.0), now in the pilot, reduces the limit to 1000, but - as the patch contains several other changes - it will not be deployed in production until next year.

---+++ CMS

   * Heavy Ion run
      * Took more data than planned originally
      * Pushed the DAQ, StorageManager and PromptRECO to the limits
      * High load on CERN EOS
      * A few files lost because they were deleted from buffer discs before processed
      * Still big backlog of still unprocessed data
   * Big MC RE-DIGIRECO on going
      * Utilizing (large fractions of) CERN, Tier-1s, most Tier-2s
   * "End of the Year" data RE-RECO about to be released
   * Computing will continue with high load during Xmas break

   * Many thanks for the support in 2015, a nice !Xmas break and already now the best wishes for 2016

---+++ LHCb

   * Pre-staging data for Stripping 24.
   * Aim to run Monte Carlo during the YETS, including on HLT farm.

---++ Ongoing Task Forces and Working Groups

---+++ gLExec Deployment TF

   * NTR

---+++ Machine/Job Features TF

   * We have produced a 2nd draft of the HSF technical note and hope to be able to move it into the HSF approval process at the start of next year with no major changes. After that we will look at updating the reference implementations to match the note, and with the aim providing values for all the keys listed in the note.

---+++ HTTP Deployment TF

---+++ Information System Evolution

<br />%INCLUDE{ "EGEE.WLCGISEvolution" section="20151217" }%

---+++ IPv6 Validation and Deployment TF

<br />%INCLUDE{ "WlcgIpv6" section="20151217" }%

---+++ Middleware Readiness WG

<br />%INCLUDE{ "MiddlewareReadinessArchive" section="20151217" }%

---+++ Multicore Deployment

%INCLUDE{ "MulticoreTFReports" section="17122015" }%

---+++ Network and Transfer Metrics WG

<br />%INCLUDE{ "NetworkTransferMetrics" section="17122015" }%

---+++ RFC proxies

   * NTR

---+++ Squid Monitoring and HTTP Proxy Discovery TFs

   * Nothing new to report.  Existing code for automating monitoring based on GOCDB/OIM registration had broken but it got fixed again.

---++ Action list

| *Creation date* | *Description* | *Responsible* | *Status* | *Comments* |
| 2015-06-04 | Status of fix for Globus library (=globus-gssapi-gsi-11.16-1=) released in EPEL testing | Andrea Manzi | ONGOING | GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Dec 17 there are 2 tickets open: GGUS:117043 for *CNAF* (in progress) and GGUS:118371 for *FNAL* (in progress). |
| 2015-10-01 | Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites | SCOD team | ONGOING | A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting |
| 2015-12-17 | Recommend site configurations to enforce memory limits on jobs | | CREATED | 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system |

Julia explains why implementing in the SSB a Google calendar also for future downtimes is not trivial at all. A reasonable compromise is to have a solution implemented in GOCDB and use the current simple links for the OSG T1's.
It is then agreed to close the action and open one for the GOCDB team to implement the feature in GOCDB, as they already agreed to some time ago. Next year they will be contacted to define a reasonable timescale.

---+++ Specific actions for experiments
| *Creation date* | *Description* | *Affected VO* | *Affected TF* | *Comments* | *Deadline* | *Completion* |

---+++ Specific actions for sites

| *Creation date* | *Description* | *Affected VO* | *Affected TF* | *Comments* | *Deadline* | *Completion* |
| 2015-11-05 | ATLAS would like to ask sites to provide consistency checks of storage dumps. [[http://go.web.cern.ch/go/C9xr][More information]] and [[https://indico.cern.ch/event/445782/contribution/13/attachments/1180967/1709690/proposal_to_sites.pdf][More details]] |  ATLAS | - | Status not clear at the 2015-12-03 Ops Coord meeting (ATLAS absent) | None | CLOSED |

This action is closed, as it is being managed internally to ATLAS operations.

---++ AOB
Maria mentions that Andrea S. will not work in WLCG operations coordination from next year. She thanks Maite and Andrea for their contributions to WLCG operations coordination.

-- Main.MariaALANDESPRADILLO - 2015-12-15
Topic revision: r20 - 2018-02-28 - MaartenLitmaath
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Welcome Guest
- Cern Search
- TWiki Search
- Google Search
LCG All webs
Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback