pre-GDB about Batch Systems, CERN, May 2015
Agenda
https://indico.cern.ch/event/319821/
Sites present: KIT,
IN2P3, Liverpool,
RAL, CERN, CNAF, GRIF, CSCS, Napoli, Poland, CMS, Lucarno, DESY, NDGF, PIC
Site experiences
Liverpool: migrated to ARC + HTCondor
- HTCondor doesn't work out of the box but works very well for single and multicore jobs
- Borrowed a lot from RAL experience
- Experience written on a web site: no standard config yet
- Issue1: draining for multicore. 2 approaches tested: defrag daemon and DrainBoss (local development)
- Issue2: accounting and fairshare assignement. Facilities present in HTCondor but require local configuration
- Issue3: many security details
- Issue4: MW integration (lcmpas integration, creation of LSF files...). Various work done locally.
CERN
- Using Python bindings for pool management and accounting
- Also includes management of dynamic provisionning workflow
- Accounting db built from job history and sent to APEL
- Allows a normalization per job
- Several monitoring sensors developed and integrated into the ELK infra
- Grid pilot open to ATLAS and CMS
- Issue1: how to implement VOViews with ARC, ongoing discussion with developers
- Issue2: security configuration "unusual"
PIC: still running Torque/MAUI, in progress of migrating to HTCondor
- Both grid and local users
- Happy with Torque/MAUI features, including multijob support with mcfloat but concerns about scalability and security updates
- HTCondor choosen as alternative: open source, very responsive developers, a lot of experience/expertise in WLCG
- Migration plan: pilot + migration of grid resources (T1/T2) then pilot + migration of local resources
- Currently: small test instance, with central daemons in HA mode and accounting groups
- Also started to play with ARC CE
- Work in collaboration with CERN: regular meetings
GRIF: 2 subsites over 6 already migrated to HTCondor, 1 with ARC, the other with
CREAM
- Main concern: BDII publication. Obsolete component, not handling properly multicore/partitionable slots
- 1 other subsite migrating
- No real showstopper but still hardening the configuration
CSCS: migrated to PBSPro to
SLURM 1 year ago, initially with
CREAM and now with ARC
- No major issue migrating
- Investigating cgroups
- HA configuration
- 1 SLURM onsite
Nebraska: mainly HTCondor
- SLURM used for non grid cluster, in particular for HPC HW with Infiniband
KIT: migrated a few years ago from PBSPro to GE
- Very stable since then
- Main concern: scaling properly the accounting with different kind of HW for WNs
- Looking at ARC CE but would also be interested to hear about HTCondorCE
CNAF: using
LSF, still happy, lots of local tools
- Also satisfied by CREAM CE, even though it is sensitive to misbehaviour from submitters
- Version currently used not supporting cgroups: upgrade needed
DESY: Zeuthen migrated from Torque/MAUI to UGE
- WLCG T2 + Icecube T1
- Zeuthen has long experience with GE
- Smooth operation except a problem with cgroups (kernel bug introduced by a security update)
- Fair reservation of multicore job slots still an issue
Accounting (J. Gordon)
- Want to stress importance of including accounting in batch system validation, including new major versions
- Cannot expect everything to always work off the shelf...
- ARC issues
- No use of BDII for discovering message brokers: one hardcoded, no automatic recovery in case of failure, only a script allowing archiving data for later manual reprocessing. Already happened that archived data was garbage collected before being published...
- ARC CE accounts for everything successfully authenticated, even if it does not reach a batch system: results in zero CPU jobs
HTCondor
BDII/GIP
Current usage
- Job brokering
- Reporting: mainly installed capacity
- VOViews: ALICE
- To be better understood: what is exactly the use case?
- Don't forget non LHC VOs: discuss with EGI and OSG
- A TF in OpsCoord to clarify this? May be no need for a TF, LHC VOs should be able to answer quickly
- Source of information for job CPU time normalization
GLUE1 and GLUE2: GLUE1 still used by most tools
- ARC GLUE1 information has many issues: need fixing
HTCondor: gip plugin used by
CREAM CE publishing unmaintained
- ARC publishing everything, no need for this plugin
- Difficult to query HTCondor for maximum WC time for jobs
- Currently a policy expression that is difficult to parse, may change in the future
- Patch done by Liverpool for ARC CE: need to contribute it back * CREAM CE: current status of the plugin uncertain. Has to be checked with EGI. * Some patches done by Andrew, happy to share them * Start a community effort on GitHub?
Initial configuration and management
CERN: using HEP-Puppet modules (maintained by Oxford/Bristol), including VO configuration
- Also for ARC CE
- Some Puppet modules improved: will contribute the improvements back
- Accounting groups pre-configured: user dynamically mapped based on their VO, on the schedd
- Considering using Python-based classads: ability to use a Python function as part of a classads, giving more flexibility on what can be done
- Require 8.3.4
- Brian: take into consideration that classads language is designed for not having side effects...
- Python bindings also use to extract jobs events, fill the accounting db and EK (with Flume rather than Logstash) for monitoring
- Todd: work in progress to enhance condor_ganglia daemon to be able to dump in json files the aggregate collected
RAL: using Quattor for the configuration of Condor itself + other required services
- Quattor development coordinated with the community
- Also using Python bindings to interact with Quattor
- Complex local policy for mapping users to accounting groups
- Classads about job start/completion sent to ELK, also working on using Google CAdvisor
- Todd: interested to get feedback on what is useful out-of-the-box for sites
Liverpool: YAIM used mainly to display the gap in the configuration
Rejecting job submission from a VO: available in dev version 8.3.5
- SUBMIT_REQUIREMENT: expression that must be true for a job to be accepted
- Also the ability to drain (stop match making) but no VO granularity
- Dev versions are used in prod at several sites before being released: already pretty stable but new features may be added or not in final form
- More a beta/release candidates: CMS tends to always run the last development release
Brian: how often sites update their batch systems (not only HTCondor sites)
- Liverpool: try to avoid updates only if really required
- PIC: used to do it at least once a year
- NDGF: expectation is that it is stable enough for not requiring updates during a long time...
- CSCS: SLURM updates tended to be a lot of work, should improve in the future
- Brian: HTCondor offer backward and forward compatibility, allowing for rolling updates
- Generally without a downtime at most HTCondor CMS sites
- RAL also did several upgrades without any downtime (just WN being drained)
Memory limitation
RAL: cgroups used for CPU and mem as PID namespace, mount under scratch
- Issue with memory cgroups: all jobs are killed by Condor when a job exceeds the phys memory available, not only the bad job
Mattias: EL6 seems to be not completely appropriate for memory cgroups
- Todd: a few bugs under EL6 kernels fixed in EL7
- Brian: still a few bugs in recent kernels causing crashes, even if the rate seems to reduce
- RH 2.6.32 kernels have a lot of things backported from 3.x
Todd: hybrid approach is to use cgroups for monitoring memory usage and use HTCondor to really handle jobs over limit
- Mattias: but the risk of causing jobs being killed because of their disk dirty pages...
Todd: how much swap on WNs at sites
- RAL : same amount as phys memory... only 20% of the swap usable by Condor cgroup
- NDGF: also a rather small swap space. Possible to limit the swap space per job through cgroups too
- Todd: in a future version, Condor will allow to configure that by default a job has no swap space
CERN: currently memory limitation done by HTCondor, looking at cgroups
Multicore jobs and memory limitation: cgroups seem the only viable approach
- Seems to work fine with HTCondor
Multicore job support
Support easy with partitionable slots
Draining the single job slots to have multicore slots
- Stephen Jones:
- defrag daemon working but needs to be configured with ClassAds is not easy
- DrainBoss: use the Python API to adapt the need to recover multicore jobs to the multicore jobs received, no dependency on the job priority being adjusted
- Ready to be tried by other sites (disable defrag daemon), packaged as a RPM
- Todd: may be nice to be able to use a tool like DrainBosss be callable through a defrag daemon callout. Possible concern about scalability using 'condor_q -all': new option 'clustering' may help reducing the load on large configuration.
- RAL: using defrag daemon, giving the priority to the multicore jobs, works well for us
- Todd: may also use the START expression to give the priority during a certain period to multicore jobs if they are any, after a multicore slot has been drained on a WN
Backfilling: not much progress
- A script available that allows jobs to pass parameters with backend for most batch systems
- But experiment don't pass the parameters and this is a little bit orthogonal to the "job masonry" done inside the multicore pilot by VOs (in particular CMS and LHCb)
- Long multicore are desirable as they help to maintain a constant load of multicore jobs
- Most important: feedback VOs with site observations so that they can improve their strategy
Fairshare
Fairshare between accounting groups is instantaneous: historical fairshares implementable through target share adjustments
- CERN working on a tool to automate it: still work in progress
- RAL and PIC have seen no problem with instantaneous fairshares
Andreas (KIT): with UGE, using historical fairshare for only 50% of the decision to avoid drawbacks of historical fairshares (a VO forbidden to run during several days).
CE: Alternative to CREAM
ARC CE
ARC experience
- RAL: smooth operation, the only issue was the way the ARC CE was getting the information from HTCondor history, causing a scaling issue
- Now able to use per job information file, much much faster, part of the new ARC CE release
- A few minor tricks required for LHCb and ALICE
- Also support non LHC VOs through DIRAC or WMS (proxy renewal problem on WMS)
- Accounting: WN classads has the scaling factor, HTCondor accounting information "fixed" based on it at the end of the job
- Liverpool: main concern is the way the accounting is published directly with JURA
- Had issue for a small number of jobs that the accounting info was never published: though the pb was identified (file moved too early to an archive directory) but a recent problem again...
- Supporting many non LHC VOs
- GRIF: would like to work on publishing accounting through APEL box
- Ability to republish easily
- Local accounting dashboard
- Mattias: contribution would certainly be accepted, in fact NDGF is not publishing directly but through a SGAS db
- Andrew: has seen somebody reporting on the ARC CE mailing list already doing it
- Glasgow issue with garbage collection may have been caused by a local script
- Mattias: aware that most complaints about the ARC CE are related to JURA
- CERN: configuration using puppet module from HEP-Puppet
- Also the specific tool to process the accounting information per job and do the normalization based on the exact performance of the WN used by the job. Probably shareable.
ARC/ARGUS integration: a concern raised during ARGUS meeting in December about the way ARC was using ARGUS that was causing scalability problems
- Not seen by "EGI" sites where LCMAPS is used to talk to ARGUS
Mattias: proposal of a workshop for WLCG sites migrating to ARC, in particular for pilot-workflow used in WLCG
- October? Co-located with a European Condor workshop?
ARC development: a small core team (5 people, 2.5 FTEs) + contributions from the community that are welcome
- An open-source project
- Good level of manpower for the current support and the incremental changes
- A major development would require an additional effort
HTCondor CE
HTCondor CE: designed by OSG as part of the next generation CE effort
- HTCondor already used for many additional things to a batch system: pilot infrastructure, monitoring... Idea to reuse it also for the CE, reducing the number of different components
- Idea: use plain Condor with a specific configuration rather than develop a specific product
- Main features required already present but never put together including GSI support, ability to submit to non HTCondor batch systems...
- A gateway to the site services but no specific advanced services offered
- Accounting separate from HTCondor itself: currently integration with GRATIA
- No BDII support: none planned
- Scalability tested as pretty good: limit was due to the batch system behind
- No queue name
CERN: had a look to HTCondor CE at beginning of this year but currently seen as not matching enough the concept of a site CE in Europe
- May look at it again when this has evolved
- Mainly a packaging issue...
KIT/PIC: was interested to look at it but the lack of support for BDII and APEL was a showstopper, no effort available to look at it
- Maria: a few HTCondor CE seen in the BDII, to be checked
Wrap-Up
Plan a future workshop around the end of the year: European HT Condor workshop and ARC CE
Create a mailing list (egroup) as a channel of communication between everybody interested in these discussions
- Start with one list for all batch systems and topics
- This page will be updated when the list is created
Agree on a code hosting place where we could share developments done by sites around HTCondor and ARC CE