WLCG Operations Coordination Minutes, January 7th 2016

Highlights

  • WLCG workshop: Registration closes on 22nd January.
  • The HTTP TF will be able to close when 90% of the sites will show correctly configured without interrupt for over a week. The TF opened GGUS tickets to give the sites all relevant instructions.
  • The Multicore Deployment TF announced that WLCG users should mainly use the Tier1 and Tier2 views which now use the same data as the production portal (ie include cores).

Agenda

Attendance

  • local: Maria Alandes (chair), Maria Dimou (minutes), Maarten Litmaath, David Cameron, Andrea Manzi, Jerome Belleman, Julia Andreeva, Gavin McCance, Helge Meinhard, Oliver Keeble, Marian Babik, Xavier Espinal,
  • remote: Michael Ernst, Christoph Wissing, Jeremy Coles, Massimo Sgaravatto, Catherine Biscarat, Alessandra Doria, Gareth Smith, Rob Quick, Ulf Tigerstedt, Alessandra Forti, Di Qinq, Renaud Vernet, Dave Mason, Daniele Bonacorsi, Antonio Yzquierdo, Josep Flix, Zoltan Mathe (LHCb), Javier Sanchez, Federico Melaccio, Anton Gamel, B. Jashal (T2_IN_TIFR).
  • apologies: Vincenzo Spinoso (EGI)

Operations News

  • Andrea Sciaba has stopped working in WLCG Operations. Many thanks for his valuable contribution! Maria Dimou and Maria Alandes will remain part of the Operations Coordination team at CERN, together with Pepe and Alessandra.
  • WLCG workshop: Registration closes on 22nd January.
  • Memory limits for batch queues: At the MB of 27.10.2015, it was decided to put an action on WLCG Operations to produce a set of recipes about how to best configure memory limits for batch queues. Operations coordination will open a set of GGUS tickets to a selection of sites (mostly T1s and a few T2s). Please, be ready to provide the necessary input. Thanks in advance.

Middleware News

Maria Alandes asked whether Redhat released the openldap fixes that we tested successfully. The answer is 'not yet'.

Tier 0 News

  • Condor: 86 kHS06 → 96 kHS06 out of a total of 784 kHS06 since 10 Dec

DB News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Best wishes for 2016!
  • Normal to high activity levels during the break
    • Thanks to the sites for keeping things in good shape!
    • The first round of the heavy-ion reconstruction finished!
  • CASTOR issues
    • Dec 18: alarm ticket GGUS:118443 because the transfer manager was stuck
      • Fixed later that afternoon, thanks!
    • Dec 31: team ticket GGUS:118554 because of same problem
      • OK again since Jan 1 00:00, thanks!
    • Jan 5: team ticket GGUS:118619 ditto
      • debugged live by the devs
      • root cause was not found yet
  • EOS issues
    • Dec 31: team ticket GGUS:118559 for EOS at CERN
      • Partly due to EOS-ALICE being ~full !
      • Some disk servers were unavailable
      • Mitigated by the admins, thanks!
  • KIT
    • Dec 31: tape SE working again, thanks!

ATLAS

  • Smooth operations over the whole xMas break, almost steadily between 230-250k running parallel slots.
  • Reprocessing:
    • Almost completely finished the whole reprocessing campaign (around 1.8PB of RAW input data) during the Xmas break.
    • This is quite a remarkable result, in the past comparable reprocessing campaigns took 4-6 weeks.
    • Thanks to the effort of the sites which were extremely stable during the xmas period and to some experts who made sure that the few issues were quickly understood and solved.
  • FTS3:
    • Another possibly quite dangerous bug.
    • Noticed few lost files (registered in Rucio but not on storage) on Monday, it took few days to understand it, today an email has been sent to the FTS devels.
  • Minor: some sites noticed that some jobs (very few, event generation using MadGraph library) were creating troubles to the WN where they run.
    • This is because they produce large amount of outputs and the output log tarball contained data files.
    • The problem has been understood, there is the need of a fix in the ATLAS transformation which can take few weeks to be done and be put in production, so we decided also to add some "safety" on the pilot which will make sure that this problem will be caught before it will create trouble on the WNs. This fix will be most probably released in one week/10days from now.

Maria Alandes asked which were the reasons for the reprocessing time reduction. The reasons are multiple: many more cores, better network performance and improved software quality.

CMS

  • Happy New Year to everyone!
  • Rather high production load over Xmas break
    • Run more 100k jobs in parallel at many days
    • HLT (High Level Trigger) contributed a few thousand cores
    • No major issues
  • Tier-0 / PromptRECO
    • Backlog of pending jobs not fully cleared during the break
    • Partly due to lacking resources at CERN
    • Needed help from experts to provision fresh VMs GGUS:118546
  • Tape operations
    • Had a rather long backlog of not approved tape migrations at FNAL before Xmas break
      • Sorted out via CMS site contacts
    • Some datasets not moving at RAL

LHCb

  • Activities:
    • Monte Carlo and user analysis.
    • Pre-staging the data for re-stripping is almost finished.

  • Issue:
    • Problem pre-staging files at RRCKI
    • Nickname VOMS attribute can not be retrieved (GGUS:118361)

There was a discussion on the reasons why the above ticket has no activity since Dec. 16th and status "On hold". It should be followed up by LHCb offline.

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

Machine/Job Features TF

HTTP Deployment TF

Information System Evolution


  • IS TF meeting scheduled tomorrow Friday 8th January. ( Agenda)
    • Definitions: summary of the proposed definitions and feedback from sys admins.
    • Status of new IS: news on the feedback given so far by experiments.
    • Preparation for the WLCG workshop discussion about the IS.

IPv6 Validation and Deployment TF


Middleware Readiness WG


The JIRA dashboard shows per experiment and per site the product versions pending for Readiness verification. Changes since the Ops Coord. meeting of Dec. 17th are few due to the year end holidays. Details:

Multicore Deployment

  • Accounting:
    • John Gordon update: The default EGI has not changed but WLCG users should mainly use the Tier1 and Tier2 views (eg http://accounting.egi.eu/tier1.php ) which now use the same data as the production portal (ie include cores). The EMI3(WLCG) view also includes cores and would be useful to view an integrated view of a country including both its Tier1, Tier2, Tier3 and other sites.
    • On ATLAS side working on comparing accounting records in the dashboard and in APEL site by site for the T1 and region by region for T2s.

Network and Transfer Metrics WG


  • Stable operations of the perfSONAR pipeline (collector, datastore, publisher and dashboard), minor instability in the dashboard reported yesterday, being followed up by OSG
  • Additional monitoring metrics will be added to psomd.grid.iu.edu to capture collector's efficiency and report on freshness of the metadata in the OSG Datastore (for each sonar).
  • Proposed re-organization of the WG meetings, split into two areas, perfSONAR operations (throughput calls) and research/pilot projects
    • perfSONAR operations - main scope would be to continue with perfSONAR support, follow up on the existing infrastructure while at the same time start looking into issues already shown by the existing tools and try to fix them at the source. As this scope is well aligned with the existing North American throughput calls, we could alternate the meetings and publish common notes.
    • Research/pilot projects - will have separate on-demand meetings with notes published to WG mailing list
    • F2F meeting once a year, co-located with GDB or other workshop/conference
  • Pilot projects: LHCb DIRAC bridge available online

RFC proxies

  • NTR

Squid Monitoring and HTTP Proxy Discovery TFs

  • NTR

Action list

Creation date Description Responsible Status Comments
2015-06-04 Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Maarten ONGOING GGUS:114076 is now closed. However, host certificates need to be fixed for any service in WLCG that does not yet work OK with the new algorithm. Otherwise we will get hit early next year when the change finally comes in Globus 6.1. 16 GGUS tickets opened for SRM and Myproxy certificates not correct, most already closed. OSG and EGI contacted (Maarten alerted the few affected EGI sites as well). On Jan 7 there are 2 tickets open: GGUS:117043 for CNAF (in progress) and GGUS:118371 for FNAL (in progress). Maarten will follow-up the progress of these tickets. They will be mentioned at the 3pm Ops call on Jan 11th
2015-10-01 Define procedure to avoid concurrent OUTAGE downtimes at Tier-1 sites SCOD team CLOSE & Open New A Google calendar is not yet available, therefore the only way for the moment is to check GOCDB and OIM for downtimes using the links that will be provided for convenience in the minutes of the 3 o'clock operations meeting. Julia explains why implementing in the SSB a Google calendar also for future downtimes is not trivial at all. A reasonable compromise is to have a solution implemented in GOCDB and use the current simple links for the OSG T1's. It is then agreed to close the action and open one for the GOCDB team to implement the feature in GOCDB, as they already agreed to some time ago. Next year they will be contacted to define a reasonable timescale. At the Jan 7th meeting, Maria Alandes reported that she is in touch wiht GOCDB and more news will hopefully come next week.
2015-12-17 Recommend site configurations to enforce memory limits on jobs   CREATED 1) create a twiki, 2) ask T0/1 sites and possibly others to describe their configurations, 3) derive recommendations for each batch system. Comment from the 2016-01-07 Ops Coord meeting: The existing twiki BSPassingParameters will be enhanced to contain the recommended memory limit values per batch system, as requested by the MB. The details will be discussed offline between Marias A. & D, Maarten and Alessandra F. Status of Jan 12th: A new twiki BatchSystemsConfig was finally decided as a better idea. Tickets opened.

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion

AOB

-- MariaDimou - 2016-01-05

Edit | Attach | Watch | Print version | History: r41 < r40 < r39 < r38 < r37 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r41 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback