LCG Web>WLCGGDBDocs>GDBMeetingNotes20150512 (2015-05-20, AndreasHaupt)

EditAttachPDF

pre-GDB about Batch Systems, CERN, May 2015

Agenda
Site experiences
HTCondor
CE: Alternative to CREAM
- ARC CE
- HTCondor CE
Wrap-Up

Agenda

https://indico.cern.ch/event/319821/

Sites present: KIT, IN2P3, Liverpool, RAL, CERN, CNAF, GRIF, CSCS, Napoli, Poland, CMS, Lucarno, DESY, NDGF, PIC

Site experiences

Liverpool: migrated to ARC + HTCondor

HTCondor doesn't work out of the box but works very well for single and multicore jobs
- Borrowed a lot from RAL experience
Experience written on a web site: no standard config yet
- https://www.gridpp.ac.uk/wiki/Example_Build_of_an_ARC/Condor_Cluster
Issue1: draining for multicore. 2 approaches tested: defrag daemon and DrainBoss (local development)
Issue2: accounting and fairshare assignement. Facilities present in HTCondor but require local configuration
Issue3: many security details
Issue4: MW integration (lcmpas integration, creation of LSF files...). Various work done locally.

CERN

Using Python bindings for pool management and accounting
- Also includes management of dynamic provisionning workflow
Accounting db built from job history and sent to APEL
- Allows a normalization per job
Several monitoring sensors developed and integrated into the ELK infra
Grid pilot open to ATLAS and CMS
Issue1: how to implement VOViews with ARC, ongoing discussion with developers
Issue2: security configuration "unusual"

PIC: still running Torque/MAUI, in progress of migrating to HTCondor

Both grid and local users
Happy with Torque/MAUI features, including multijob support with mcfloat but concerns about scalability and security updates
HTCondor choosen as alternative: open source, very responsive developers, a lot of experience/expertise in WLCG
Migration plan: pilot + migration of grid resources (T1/T2) then pilot + migration of local resources
Currently: small test instance, with central daemons in HA mode and accounting groups
- Everything puppetized
Also started to play with ARC CE
Work in collaboration with CERN: regular meetings
- Extend to other sites?

GRIF: 2 subsites over 6 already migrated to HTCondor, 1 with ARC, the other with CREAM

Main concern: BDII publication. Obsolete component, not handling properly multicore/partitionable slots
1 other subsite migrating
No real showstopper but still hardening the configuration

CSCS: migrated to PBSPro to SLURM 1 year ago, initially with CREAM and now with ARC

No major issue migrating
Investigating cgroups
HA configuration
1 SLURM onsite

Nebraska: mainly HTCondor

SLURM used for non grid cluster, in particular for HPC HW with Infiniband

KIT: migrated a few years ago from PBSPro to GE

Very stable since then
Main concern: scaling properly the accounting with different kind of HW for WNs
Looking at ARC CE but would also be interested to hear about HTCondorCE

CNAF: using LSF, still happy, lots of local tools

Also satisfied by CREAM CE, even though it is sensitive to misbehaviour from submitters
Version currently used not supporting cgroups: upgrade needed

DESY: Zeuthen migrated from Torque/MAUI to UGE

WLCG T2 + Icecube T1
Zeuthen has long experience with GE
Smooth operation except a problem with cgroups (kernel bug introduced by a security update)
Fair reservation of multicore job slots still an issue

Accounting (J. Gordon)

Want to stress importance of including accounting in batch system validation, including new major versions
Cannot expect everything to always work off the shelf...
ARC issues
No use of BDII for discovering message brokers: one hardcoded, no automatic recovery in case of failure, only a script allowing archiving data for later manual reprocessing. Already happened that archived data was garbage collected before being published...
ARC CE accounts for everything successfully authenticated, even if it does not reach a batch system: results in zero CPU jobs

HTCondor

BDII/GIP

Current usage

Job brokering
Reporting: mainly installed capacity
VOViews: ALICE
To be better understood: what is exactly the use case?
Don't forget non LHC VOs: discuss with EGI and OSG
A TF in OpsCoord to clarify this? May be no need for a TF, LHC VOs should be able to answer quickly
Source of information for job CPU time normalization

GLUE1 and GLUE2: GLUE1 still used by most tools

ARC GLUE1 information has many issues: need fixing

HTCondor: gip plugin used by CREAM CE publishing unmaintained

ARC publishing everything, no need for this plugin
Difficult to query HTCondor for maximum WC time for jobs
Currently a policy expression that is difficult to parse, may change in the future
Patch done by Liverpool for ARC CE: need to contribute it back * CREAM CE: current status of the plugin uncertain. Has to be checked with EGI. * Some patches done by Andrew, happy to share them * Start a community effort on GitHub?

Initial configuration and management

CERN: using HEP-Puppet modules (maintained by Oxford/Bristol), including VO configuration

Also for ARC CE
Some Puppet modules improved: will contribute the improvements back
Accounting groups pre-configured: user dynamically mapped based on their VO, on the schedd
Considering using Python-based classads: ability to use a Python function as part of a classads, giving more flexibility on what can be done
Require 8.3.4
Brian: take into consideration that classads language is designed for not having side effects...
Python bindings also use to extract jobs events, fill the accounting db and EK (with Flume rather than Logstash) for monitoring
Todd: work in progress to enhance condor_ganglia daemon to be able to dump in json files the aggregate collected

RAL: using Quattor for the configuration of Condor itself + other required services

Quattor development coordinated with the community
Also using Python bindings to interact with Quattor
Complex local policy for mapping users to accounting groups
Classads about job start/completion sent to ELK, also working on using Google CAdvisor
Todd: interested to get feedback on what is useful out-of-the-box for sites

Liverpool: YAIM used mainly to display the gap in the configuration

Rejecting job submission from a VO: available in dev version 8.3.5

SUBMIT_REQUIREMENT: expression that must be true for a job to be accepted
Also the ability to drain (stop match making) but no VO granularity
Dev versions are used in prod at several sites before being released: already pretty stable but new features may be added or not in final form
More a beta/release candidates: CMS tends to always run the last development release

Brian: how often sites update their batch systems (not only HTCondor sites)

Liverpool: try to avoid updates only if really required
PIC: used to do it at least once a year
NDGF: expectation is that it is stable enough for not requiring updates during a long time...
CSCS: SLURM updates tended to be a lot of work, should improve in the future
Brian: HTCondor offer backward and forward compatibility, allowing for rolling updates
Generally without a downtime at most HTCondor CMS sites
RAL also did several upgrades without any downtime (just WN being drained)

Memory limitation

RAL: cgroups used for CPU and mem as PID namespace, mount under scratch

Issue with memory cgroups: all jobs are killed by Condor when a job exceeds the phys memory available, not only the bad job

Mattias: EL6 seems to be not completely appropriate for memory cgroups

Todd: a few bugs under EL6 kernels fixed in EL7
Brian: still a few bugs in recent kernels causing crashes, even if the rate seems to reduce
- RH 2.6.32 kernels have a lot of things backported from 3.x

Todd: hybrid approach is to use cgroups for monitoring memory usage and use HTCondor to really handle jobs over limit

Mattias: but the risk of causing jobs being killed because of their disk dirty pages...

Todd: how much swap on WNs at sites

RAL : same amount as phys memory... only 20% of the swap usable by Condor cgroup
NDGF: also a rather small swap space. Possible to limit the swap space per job through cgroups too
Todd: in a future version, Condor will allow to configure that by default a job has no swap space

CERN: currently memory limitation done by HTCondor, looking at cgroups

Multicore jobs and memory limitation: cgroups seem the only viable approach

Seems to work fine with HTCondor

Multicore job support

Support easy with partitionable slots

Draining the single job slots to have multicore slots

Stephen Jones:
- defrag daemon working but needs to be configured with ClassAds is not easy
- DrainBoss: use the Python API to adapt the need to recover multicore jobs to the multicore jobs received, no dependency on the job priority being adjusted
- Ready to be tried by other sites (disable defrag daemon), packaged as a RPM
- Todd: may be nice to be able to use a tool like DrainBosss be callable through a defrag daemon callout. Possible concern about scalability using 'condor_q -all': new option 'clustering' may help reducing the load on large configuration.
RAL: using defrag daemon, giving the priority to the multicore jobs, works well for us
Todd: may also use the START expression to give the priority during a certain period to multicore jobs if they are any, after a multicore slot has been drained on a WN

Backfilling: not much progress

A script available that allows jobs to pass parameters with backend for most batch systems
But experiment don't pass the parameters and this is a little bit orthogonal to the "job masonry" done inside the multicore pilot by VOs (in particular CMS and LHCb)
- Long multicore are desirable as they help to maintain a constant load of multicore jobs
Most important: feedback VOs with site observations so that they can improve their strategy

Fairshare

Fairshare between accounting groups is instantaneous: historical fairshares implementable through target share adjustments

CERN working on a tool to automate it: still work in progress
RAL and PIC have seen no problem with instantaneous fairshares

Andreas (KIT): with UGE, using historical fairshare for only 50% of the decision to avoid drawbacks of historical fairshares (a VO forbidden to run during several days).

CE: Alternative to CREAM

ARC CE

ARC experience

RAL: smooth operation, the only issue was the way the ARC CE was getting the information from HTCondor history, causing a scaling issue
- Now able to use per job information file, much much faster, part of the new ARC CE release
- A few minor tricks required for LHCb and ALICE
- Also support non LHC VOs through DIRAC or WMS (proxy renewal problem on WMS)
- Accounting: WN classads has the scaling factor, HTCondor accounting information "fixed" based on it at the end of the job
Liverpool: main concern is the way the accounting is published directly with JURA
- Had issue for a small number of jobs that the accounting info was never published: though the pb was identified (file moved too early to an archive directory) but a recent problem again...
- Supporting many non LHC VOs
GRIF: would like to work on publishing accounting through APEL box
- Ability to republish easily
- Local accounting dashboard
- Mattias: contribution would certainly be accepted, in fact NDGF is not publishing directly but through a SGAS db
- Andrew: has seen somebody reporting on the ARC CE mailing list already doing it
Glasgow issue with garbage collection may have been caused by a local script
Mattias: aware that most complaints about the ARC CE are related to JURA
CERN: configuration using puppet module from HEP-Puppet
- Also the specific tool to process the accounting information per job and do the normalization based on the exact performance of the WN used by the job. Probably shareable.

ARC/ARGUS integration: a concern raised during ARGUS meeting in December about the way ARC was using ARGUS that was causing scalability problems

Not seen by "EGI" sites where LCMAPS is used to talk to ARGUS

Mattias: proposal of a workshop for WLCG sites migrating to ARC, in particular for pilot-workflow used in WLCG

October? Co-located with a European Condor workshop?

ARC development: a small core team (5 people, 2.5 FTEs) + contributions from the community that are welcome

An open-source project
Good level of manpower for the current support and the incremental changes
- A major development would require an additional effort

HTCondor CE

HTCondor CE: designed by OSG as part of the next generation CE effort

HTCondor already used for many additional things to a batch system: pilot infrastructure, monitoring... Idea to reuse it also for the CE, reducing the number of different components
Idea: use plain Condor with a specific configuration rather than develop a specific product
- Main features required already present but never put together including GSI support, ability to submit to non HTCondor batch systems...
A gateway to the site services but no specific advanced services offered
Accounting separate from HTCondor itself: currently integration with GRATIA
No BDII support: none planned
Scalability tested as pretty good: limit was due to the batch system behind
No queue name

CERN: had a look to HTCondor CE at beginning of this year but currently seen as not matching enough the concept of a site CE in Europe

May look at it again when this has evolved
Mainly a packaging issue...

KIT/PIC: was interested to look at it but the lack of support for BDII and APEL was a showstopper, no effort available to look at it

Maria: a few HTCondor CE seen in the BDII, to be checked

Wrap-Up

Plan a future workshop around the end of the year: European HT Condor workshop and ARC CE

Create a mailing list (egroup) as a channel of communication between everybody interested in these discussions

Start with one list for all batch systems and topics
This page will be updated when the list is created

Agree on a code hosting place where we could share developments done by sites around HTCondor and ARC CE

GitHub?

Topic revision: r4 - 2015-05-20 - AndreasHaupt

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback