WLCG Operations Coordination Minutes, December 17th 2015

Highlights

  • A critical bug was found in dCache and it may cause significant data loss. It affects versions 2.12.[0,27], 2.13.[0,15], 2.14.[0,4] if BerkeleyDB is used as backend. All concerned sites MUST apply the fix described in https://twiki.cern.ch/twiki/pub/LCG/WLCGBaselineVersions/dcache-bug.txt as soon as possible!
  • All ATLAS sites that have not done it already should take action ASAP on the tickets they received to run storage consistency checks
  • Thanks to everybody for the good work in WLCG operations during 2015!

Agenda

Attendance

  • local: Maria Dimou (chair), Andrea Sciabà (minutes), Maarten Litmaath, David Cameron, Andrea Manzi, Maite Barroso, Jerome Belleman, Julia Andreeva, Alessandro Di Girolamo
  • remote: Michael Ernst, Christoph Wissing, Jeremy Coles, Massimo Sgaravatto, Catherine Biscarat, Alessandra Doria, Gareth Smith, Rob Quick, Ulf Tigerstedt, Vincenzo Spinoso, Hung-Te Lee
  • apologies: Andrew McNab (MJF TF)

Operations News

  • The MB asked Operations Coordination to produce a recommendation on memory limits configuration for batch queues.
  • Our next meeting will take place on Jan. 7th 2016.
  • Workshop for HTCondor and ARC CE users in Barcelona, Spain on Feb 29 2016 – March 4 2016. Aimed at users and admins of HTCondor, HTCondor-CEs and ARC-CEs. Several talks and tutorials, meetings with the developers. Proposals for contributions can be sent to hepix-condorworkshop2016-interest (at) cern (dot) ch. More information at https://indico.cern.ch/e/Spring2016HTCondorWorkshop.

Maite announces that she'll no longer represent the Tier-0. A successor (or a rota of them) will be chosen early next year.

Concerning the recommendations for configuring memory limits: the motivation is to improve on the current situation, where sites have to find out without any guidance how to set up memory limits for jobs, which also causes experiments to observe inconsistent behaviours among different sites. It has also implications on purchasing new hardware.

After some discussion, it was agreed that the first step will be to collect from sites information on their current setup (e.g. from the Tier-1s and any willing Tier-2s) and put it in a twiki. Tickets will be used only in case of insufficient feedback. Finally, recommendations will be given, depending on the batch system used and any other relevant factor.

Alessandro mentions that the HTCondor/ARC workshop will overlap with the ATLAS software week.

Middleware News

  • Issues:
    • As reported by Ulf, a quite serious bug is affecting dCache v 2.12/2.13/2.14. Today dCache released a patch, we will contact the sites with more details ASAP.
    • Good news regarding the openldap crash affecting Top BDII and ARC-CE. A new set of rpms has been provided by RedHat, and tested at CERN, DESY and by ARC devs. The issue seems to be finally solved so we are now pushing RedHat to release the new version ASAP
    • A quite big issue affecting gfal2-2.10.2 ( copy to/from SRM failed when using BDII resolution) has been discovered only in production. A fix has been immediately pushed to EPEL stable ( gfal2.-2.10.3)

Maarten adds that, for the dCache bug, most likely not all sites are affected, as it depends on the local configuration. Also, dCache site admins normally are subscribed to the dCache admin forum and would thus have already been informed of all the details. Still it would be good to send a WLCG broadcast about it (done by Andrea M). Alessandro mentions that NDGF was severely hit by it because it happens when doing dist-to-disk copies, and many disk servers were being decommissioned. A broadcast will be sent just after the meeting

  • T0 and T1 services
    • ASGC
      • Castor Decommissioning planned for the end of the year
    • CNAF
      • plan to upgrade to Storm to 1.11.0 when released and move to the new storm-webdav from storm-http
    • IN2P3
      • dCache upgraded to v 2.13.4
    • NDGF
      • dCache upgraded to v 2.14.4

Tier 0 News

Jerome reports that now the HTCondor pool has more than 85 kHS06 of computing power (corresponding to about 10K slots).

DB News

Tier 1 Feedback

  • NDGF-T1 had a good update to dCache 2.14 on Monday, but then noticed a bug.. that had been introduced into dCache 2.12.0 in January. It causes files that have been moved around within the storage system to lose the stickiness flag, marking the files available for garbage collection. This is ok (and default behaviour) for tape files, but not for disk files. So far we know of 1428 lost files. Alice and Atlas will get a list of files at some point this week. (Ulf writing in since it's unclear if I can attend the meeting due to travel). The bug has been fixed in dCache, and affects 2.12, 2.13 and 2.14 releases.

Ulf adds that also PIC and some German sites were hit by this bug.

Tier 2 Feedback

Experiments Reports

ALICE

  • The heavy ion data taking has ended successfully!
    • Reconstruction and reprocessing will continue for many more weeks
    • The RSS memory usage has remained up to max ~2.5 GB
    • High-memory arrangements were undone also at KISTI and KIT
      • to allow more job slots to be used again, thanks!
  • The CASTOR team then rearranged the ALICE disk servers into a single pool:
    • to allow convenient usage of all available resources, thanks!
  • Grid activity has been high
  • Expectations for the end-of-year break:
    • steady MC production
    • heavy ion reconstruction
    • low analysis activity

  • Thanks to all sites and experts for another successful year!
  • Season's greetings and best wishes for 2016!

ATLAS

  • During xmas break
  • FTS: we are still suffering of critical issues. This time, yesterday, it's most probably related to the high prestaging activity (to prestage data for reprocessing) . We have asked FTS devs to clarify what is the best course of actions to minimize the issues over the xmas break.
    • DDM has implemented an automatic restart in case of issues from FTS: this will just mitigate the issue.
  • Storage Consistency checks: dear sites, please answer to the GGUS ticket. Overview: total GGUS tickets submitted approx 130, 80 closed/verified, 50 still open, 30 of which without any answer yet!!
  • Merry Xmas, happy new year: super thanks to everybody for making a such successful year!

Alessandro explains that the problem with FTS is that there is an extremely high limit on the number of files in a prestaging request, and requests with too many files will slow down and possibly "collapse" FTS or the SE. Andrea M. adds that the latest patch (3.4.0), now in the pilot, reduces the limit to 1000, but - as the patch contains several other changes - it will not be deployed in production until next year.

CMS

  • Heavy Ion run
    • Took more data than planned originally
    • Pushed the DAQ, StorageManager and PromptRECO to the limits
    • High load on CERN EOS
    • A few files lost because they were deleted from buffer discs before processed
    • Still big backlog of still unprocessed data
  • Big MC RE-DIGIRECO on going
    • Utilizing (large fractions of) CERN, Tier-1s, most Tier-2s
  • "End of the Year" data RE-RECO about to be released
  • Computing will continue with high load during Xmas break

  • Many thanks for the support in 2015, a nice Xmas break and already now the best wishes for 2016

LHCb

  • Pre-staging data for Stripping 24.
  • Aim to run Monte Carlo during the YETS, including on HLT farm.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

Machine/Job Features TF

  • We have produced a 2nd draft of the HSF technical note and hope to be able to move it into the HSF approval process at the start of next year with no major changes. After that we will look at updating the reference implementations to match the note, and with the aim providing values for all the keys listed in the note.

HTTP Deployment TF

Information System Evolution


  • A proposal for a new WLCG IS based on AGIS was presented at the last GDB.
    • Ongoing discussions with experiments to understand their interest in this new IS.
    • The proposal will be presented at the MB next year to see whether it gets approved.
  • In the meantime, the following activities are ongoing within the TF:
    • Ongoing discussion to agree on a better definition of the GLUE 2 attributes defining HS06 (GLUE2BenchmarkValue) and Logical CPUs (GLUE2ExecutionEnvironmentLogicalCPUs): feedback from sys admins is being collected for two possible definitions.
    • Presented at the last UMD meeting a proposal to validate information at its source so that we can avoid publishing information that is known to be wrong. A technical solution will have to be worked out together with MW developers.
  • Preparing the IS session at the WLCG workshop in February together with Alessandra Forti who will be the chair and who is gathering feedback on what to discuss.
  • Next IS TF meeting scheduled on Friday 8th January. ( Preliminary agenda)

IPv6 Validation and Deployment TF


Middleware Readiness WG


The JIRA dashboard shows per experiment and per site the product versions pending for Readiness verification. Changes since the Ops Coord. meeting of Dec. 3rd:

  • JIRA:MWREADY-91 CMS: PIC completed the dCache 2.13.12 verification, now switched to dCache 2.14.5.
  • JIRA:MWREADY-97 ATLAS: BRUNEL and GRIF-IRFU completed BDII 5.2.23 verification for CENTOS7. EGI releasing this BDII version in UMD.
  • JIRA:MWREADY-99 ATLAS & CMS: FTS 3.4.0 verification using the FTS CERN pilot is ongoing
  • JIRA:MWREADY-102 CMS: PIC started dCache 2.14.4 verification. They reported a problem to dCache, already fixed, and they are have now installed 2.14.5
  • JIRA:MWREADY-103 ATLAS: Triumf to start dCache 2.10.47 verification.

In order to push for the transition from lcgutils to gfal2/gfal2-util ( lcgutils is deprecated since 2 years), we have started also to discuss with ATLAS & CMS about the usage of gfal2 in production. Still 99 % of the sites are using lcgutils for stagein/out of data. We therefore decided to start verifications of gfal2/gfal2-utils:

  • JIRA:MWREADY-100 ATLAS: Napoli will verify gfal2 and gfal2-utils. Already did some test demonstrating that everything is working fine.
  • JIRA:MWREADY-101 CMS: Grif is a good candidate to verify gfal2 and gfal2-utils.

Reminder: Next meeting January 20th 2016 at 4pm CET. Agenda http://indico.cern.ch/e/MW-Readiness_15

Multicore Deployment

Network and Transfer Metrics WG


RFC proxies

  • NTR

Squid Monitoring and HTTP Proxy Discovery TFs

  • Nothing new to report. Existing code for automating monitoring based on GOCDB/OIM registration had broken but it got fixed again.

Action list

Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Dec 17 there are 2 tickets open: GGUS:117043 for CNAF (in progress) and GGUS:118371 for FNAL (in progress).
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team ONGOING A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting
2015-12-17 Recommend site configurations to enforce memory limits on jobs   CREATED 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system

Julia explains why implementing in the SSB a Google calendar also for future downtimes is not trivial at all. A reasonable compromise is to have a solution implemented in GOCDB and use the current simple links for the OSG T1's. It is then agreed to close the action and open one for the GOCDB team to implement the feature in GOCDB, as they already agreed to some time ago. Next year they will be contacted to define a reasonable timescale.

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
2015-11-05 ATLAS would like to ask sites to provide consistency checks of storage dumps. More information and More details ATLAS - Status not clear at the 2015-12-03 Ops Coord meeting (ATLAS absent) None CLOSED

This action is closed, as it is being managed internally to ATLAS operations.

AOB

Maria mentions that Andrea S. will not work in WLCG operations coordination from next year. She thanks Maite and Andrea for their contributions to WLCG operations coordination.

-- MariaALANDESPRADILLO - 2015-12-15

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback