pre-GDB about Batch Systems, CERN, May 2015

Agenda

https://indico.cern.ch/event/319821/

Sites present: KIT, IN2P3, Liverpool, RAL, CERN, CNAF, GRIF, CSCS, Napoli, Poland, CMS, Lucarno, DESY, NDGF, PIC

Site experiences

Liverpool: migrated to ARC + HTCondor

  • HTCondor doesn't work out of the box but works very well for single and multicore jobs
    • Borrowed a lot from RAL experience
  • Experience written on a web site: no standard config yet
  • Issue1: draining for multicore. 2 approaches tested: defrag daemon and DrainBoss (local development)
  • Issue2: accounting and fairshare assignement. Facilities present in HTCondor but require local configuration
  • Issue3: many security details
  • Issue4: MW integration (lcmpas integration, creation of LSF files...). Various work done locally.

CERN

  • Using Python bindings for pool management and accounting
    • Also includes management of dynamic provisionning workflow
  • Accounting db built from job history and sent to APEL
    • Allows a normalization per job
  • Several monitoring sensors developed and integrated into the ELK infra
  • Grid pilot open to ATLAS and CMS
  • Issue1: how to implement VOViews with ARC, ongoing discussion with developers
  • Issue2: security configuration "unusual"

PIC: still running Torque/MAUI, in progress of migrating to HTCondor

  • Both grid and local users
  • Happy with Torque/MAUI features, including multijob support with mcfloat but concerns about scalability and security updates
  • HTCondor choosen as alternative: open source, very responsive developers, a lot of experience/expertise in WLCG
  • Migration plan: pilot + migration of grid resources (T1/T2) then pilot + migration of local resources
  • Currently: small test instance, with central daemons in HA mode and accounting groups
    • Everything puppetized
  • Also started to play with ARC CE
  • Work in collaboration with CERN: regular meetings
    • Extend to other sites?

GRIF: 2 subsites over 6 already migrated to HTCondor, 1 with ARC, the other with CREAM

  • Main concern: BDII publication. Obsolete component, not handling properly multicore/partitionable slots
  • 1 other subsite migrating
  • No real showstopper but still hardening the configuration

CSCS: migrated to PBSPro to SLURM 1 year ago, initially with CREAM and now with ARC

  • No major issue migrating
  • Investigating cgroups
  • HA configuration
  • 1 SLURM onsite

Nebraska: mainly HTCondor

  • SLURM used for non grid cluster, in particular for HPC HW with Infiniband

KIT: migrated a few years ago from PBSPro to GE

  • Very stable since then
  • Main concern: scaling properly the accounting with different kind of HW for WNs
  • Looking at ARC CE but would also be interested to hear about HTCondorCE

CNAF: using LSF, still happy, lots of local tools

  • Also satisfied by CREAM CE, even though it is sensitive to misbehaviour from submitters
  • Version currently used not supporting cgroups: upgrade needed

DESY: Zeuthen migrated from Torque/MAUI to UGE

  • WLCG T2 + Icecube T1
  • Zeuthen has long experience with GE
  • Smooth operation except a problem with cgroups (kernel bug introduced by a security update)
  • Fair reservation of multicore job slots still an issue

Accounting (J. Gordon)

  • Want to stress importance of including accounting in batch system validation, including new major versions
  • Cannot expect everything to always work off the shelf...
  • ARC issues
  • No use of BDII for discovering message brokers: one hardcoded, no automatic recovery in case of failure, only a script allowing archiving data for later manual reprocessing. Already happened that archived data was garbage collected before being published...
  • ARC CE accounts for everything successfully authenticated, even if it does not reach a batch system: results in zero CPU jobs

HTCondor

BDII/GIP

Current usage

  • Job brokering
  • Reporting: mainly installed capacity
  • VOViews: ALICE
  • To be better understood: what is exactly the use case?
  • Don't forget non LHC VOs: discuss with EGI and OSG
  • A TF in OpsCoord to clarify this? May be no need for a TF, LHC VOs should be able to answer quickly
  • Source of information for job CPU time normalization

GLUE1 and GLUE2: GLUE1 still used by most tools

  • ARC GLUE1 information has many issues: need fixing

HTCondor: gip plugin used by CREAM CE publishing unmaintained

  • ARC publishing everything, no need for this plugin
  • Difficult to query HTCondor for maximum WC time for jobs
  • Currently a policy expression that is difficult to parse, may change in the future
  • Patch done by Liverpool for ARC CE: need to contribute it back * CREAM CE: current status of the plugin uncertain. Has to be checked with EGI. * Some patches done by Andrew, happy to share them * Start a community effort on GitHub?

Initial configuration and management

CERN: using HEP-Puppet modules (maintained by Oxford/Bristol), including VO configuration

  • Also for ARC CE
  • Some Puppet modules improved: will contribute the improvements back
  • Accounting groups pre-configured: user dynamically mapped based on their VO, on the schedd
  • Considering using Python-based classads: ability to use a Python function as part of a classads, giving more flexibility on what can be done
  • Require 8.3.4
  • Brian: take into consideration that classads language is designed for not having side effects...
  • Python bindings also use to extract jobs events, fill the accounting db and EK (with Flume rather than Logstash) for monitoring
  • Todd: work in progress to enhance condor_ganglia daemon to be able to dump in json files the aggregate collected

RAL: using Quattor for the configuration of Condor itself + other required services

  • Quattor development coordinated with the community
  • Also using Python bindings to interact with Quattor
  • Complex local policy for mapping users to accounting groups
  • Classads about job start/completion sent to ELK, also working on using Google CAdvisor
  • Todd: interested to get feedback on what is useful out-of-the-box for sites

Liverpool: YAIM used mainly to display the gap in the configuration

Rejecting job submission from a VO: available in dev version 8.3.5

  • SUBMIT_REQUIREMENT: expression that must be true for a job to be accepted
  • Also the ability to drain (stop match making) but no VO granularity
  • Dev versions are used in prod at several sites before being released: already pretty stable but new features may be added or not in final form
  • More a beta/release candidates: CMS tends to always run the last development release

Brian: how often sites update their batch systems (not only HTCondor sites)

  • Liverpool: try to avoid updates only if really required
  • PIC: used to do it at least once a year
  • NDGF: expectation is that it is stable enough for not requiring updates during a long time...
  • CSCS: SLURM updates tended to be a lot of work, should improve in the future
  • Brian: HTCondor offer backward and forward compatibility, allowing for rolling updates
  • Generally without a downtime at most HTCondor CMS sites
  • RAL also did several upgrades without any downtime (just WN being drained)

Memory limitation

RAL: cgroups used for CPU and mem as PID namespace, mount under scratch

  • Issue with memory cgroups: all jobs are killed by Condor when a job exceeds the phys memory available, not only the bad job

Mattias: EL6 seems to be not completely appropriate for memory cgroups

  • Todd: a few bugs under EL6 kernels fixed in EL7
  • Brian: still a few bugs in recent kernels causing crashes, even if the rate seems to reduce
    • RH 2.6.32 kernels have a lot of things backported from 3.x

Todd: hybrid approach is to use cgroups for monitoring memory usage and use HTCondor to really handle jobs over limit

  • Mattias: but the risk of causing jobs being killed because of their disk dirty pages...

Todd: how much swap on WNs at sites

  • RAL : same amount as phys memory... only 20% of the swap usable by Condor cgroup
  • NDGF: also a rather small swap space. Possible to limit the swap space per job through cgroups too
  • Todd: in a future version, Condor will allow to configure that by default a job has no swap space

CERN: currently memory limitation done by HTCondor, looking at cgroups

Multicore jobs and memory limitation: cgroups seem the only viable approach

  • Seems to work fine with HTCondor

Multicore job support

Support easy with partitionable slots

Draining the single job slots to have multicore slots

  • Stephen Jones:
    • defrag daemon working but needs to be configured with ClassAds is not easy
    • DrainBoss: use the Python API to adapt the need to recover multicore jobs to the multicore jobs received, no dependency on the job priority being adjusted
    • Ready to be tried by other sites (disable defrag daemon), packaged as a RPM
    • Todd: may be nice to be able to use a tool like DrainBosss be callable through a defrag daemon callout. Possible concern about scalability using 'condor_q -all': new option 'clustering' may help reducing the load on large configuration.
  • RAL: using defrag daemon, giving the priority to the multicore jobs, works well for us
  • Todd: may also use the START expression to give the priority during a certain period to multicore jobs if they are any, after a multicore slot has been drained on a WN

Backfilling: not much progress

  • A script available that allows jobs to pass parameters with backend for most batch systems
  • But experiment don't pass the parameters and this is a little bit orthogonal to the "job masonry" done inside the multicore pilot by VOs (in particular CMS and LHCb)
    • Long multicore are desirable as they help to maintain a constant load of multicore jobs
  • Most important: feedback VOs with site observations so that they can improve their strategy

Fairshare

Fairshare between accounting groups is instantaneous: historical fairshares implementable through target share adjustments

  • CERN working on a tool to automate it: still work in progress
  • RAL and PIC have seen no problem with instantaneous fairshares

Andreas (KIT): with UGE, using historical fairshare for only 50% of the decision to avoid drawbacks of historical fairshares (a VO forbidden to run during several days).

CE: Alternative to CREAM

ARC CE

ARC experience

  • RAL: smooth operation, the only issue was the way the ARC CE was getting the information from HTCondor history, causing a scaling issue
    • Now able to use per job information file, much much faster, part of the new ARC CE release
    • A few minor tricks required for LHCb and ALICE
    • Also support non LHC VOs through DIRAC or WMS (proxy renewal problem on WMS)
    • Accounting: WN classads has the scaling factor, HTCondor accounting information "fixed" based on it at the end of the job
  • Liverpool: main concern is the way the accounting is published directly with JURA
    • Had issue for a small number of jobs that the accounting info was never published: though the pb was identified (file moved too early to an archive directory) but a recent problem again...
    • Supporting many non LHC VOs
  • GRIF: would like to work on publishing accounting through APEL box
    • Ability to republish easily
    • Local accounting dashboard
    • Mattias: contribution would certainly be accepted, in fact NDGF is not publishing directly but through a SGAS db
    • Andrew: has seen somebody reporting on the ARC CE mailing list already doing it
  • Glasgow issue with garbage collection may have been caused by a local script
  • Mattias: aware that most complaints about the ARC CE are related to JURA
  • CERN: configuration using puppet module from HEP-Puppet
    • Also the specific tool to process the accounting information per job and do the normalization based on the exact performance of the WN used by the job. Probably shareable.

ARC/ARGUS integration: a concern raised during ARGUS meeting in December about the way ARC was using ARGUS that was causing scalability problems

  • Not seen by "EGI" sites where LCMAPS is used to talk to ARGUS

Mattias: proposal of a workshop for WLCG sites migrating to ARC, in particular for pilot-workflow used in WLCG

  • October? Co-located with a European Condor workshop?

ARC development: a small core team (5 people, 2.5 FTEs) + contributions from the community that are welcome

  • An open-source project
  • Good level of manpower for the current support and the incremental changes
    • A major development would require an additional effort

HTCondor CE

HTCondor CE: designed by OSG as part of the next generation CE effort

  • HTCondor already used for many additional things to a batch system: pilot infrastructure, monitoring... Idea to reuse it also for the CE, reducing the number of different components
  • Idea: use plain Condor with a specific configuration rather than develop a specific product
    • Main features required already present but never put together including GSI support, ability to submit to non HTCondor batch systems...
  • A gateway to the site services but no specific advanced services offered
  • Accounting separate from HTCondor itself: currently integration with GRATIA
  • No BDII support: none planned
  • Scalability tested as pretty good: limit was due to the batch system behind
  • No queue name

CERN: had a look to HTCondor CE at beginning of this year but currently seen as not matching enough the concept of a site CE in Europe

  • May look at it again when this has evolved
  • Mainly a packaging issue...

KIT/PIC: was interested to look at it but the lack of support for BDII and APEL was a showstopper, no effort available to look at it

  • Maria: a few HTCondor CE seen in the BDII, to be checked

Wrap-Up

Plan a future workshop around the end of the year: European HT Condor workshop and ARC CE

Create a mailing list (egroup) as a channel of communication between everybody interested in these discussions

  • Start with one list for all batch systems and topics
  • This page will be updated when the list is created

Agree on a code hosting place where we could share developments done by sites around HTCondor and ARC CE

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2015-05-20 - AndreasHaupt
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback