Cream CE Pilot Service: Description and Status.


  • Start Date: 10 Jun 2008
  • End Date: 17 Mar 2009
  • Description: Pilot Service of Cream CE
  • Coordinator: Antonio Retico
  • Contact e-mail: egee-pps-pilot-cream@cern.ch
  • Status : Closed
  • Related meetings

Links Criteria for transition from lcg-CE to CREAM

Description

JRA1, SA3 and SA1 are running a pilot service focused on the new Cream CE in order to collect feedback from the experiments and to accelerate the testing and deployment in production of the new service.

The pilot will be organised in two phases:

  • 1st phase: Some of the PPS sites will be gradually requested to replace their lcg-CE with CREAM. We will start with one site, published in the PPS BDII and then extend the testbed as needed. The aim of this phase is to fine-tune the installation tools (YAIM and release notes), and to verify the correct interactions of the new services with the monitoring tools. In addition to that, 1 WMS in PPS will need to be adapted to submit to cream CEs. So, initially, up to two PPS sites will be needed to support this scenario, to grow to some more (ideally one per batch-system)

  • 2nd phase: to start as soon as the installation is stable and the service has been demonstrated to be working and interacting correctly with the other components. Some production sites will be asked to add/replace one or more Cream CEs. to be published with GlueServiceStatus = 'production'. The LHC experiments will be involved in this phase to start a controlled submission of production jobs to the new service.

It is important to point out that this activity is by no means meant to replace the standard certification of the service. The certification will be carried out in parallel in the usual way and in close synergy with the pilot, so that ideally both environment will profit of the findings from the other. Sites administrators involved will have

  • to react promptly to possible issues found
  • to keep in touch with JRA1 people
  • to apply the fixes they provide
  • to communicate and keep track of them

An additional 3rd phase has been planned on 1st April.

  • 3rd phase: it has two activities: a) WMS-ICE validation b) performance/acceptance tests for CREAM.

Overall Planning

Phase1 (TASK:7143)

  • Initial plan for phase 1:
    creampilotph1.gif

  • Initial roadmap for the test in PPS:
    1. Set-up of Cream CE on torque at PPS-CNAF (eventually replacing cert-ce-03.cnaf.infn.it) and FZK-PPS (timeline: 1 week)
    2. Enabling ICE at SCAI-PPS (of FZK-PPS as back-up) (timeline: 1 week, in parallel with 1)
    3. Verification/fixing of SAM monitoring chain in PPS (the SAM client at PPS-RAL should switch to use the Cream-enabled WMS) (timeline: 1.5 weeks, starting from first successful job submission)
    4. Extension of the tests to other supported batch systems/platforms. I think that IN2P3-CC-PPS, PIC for LSF and possibly some other sites could get involved here, to be seen.
    5. Getting ready for phase 2) (PIC?, CNAF?, IN2P3?)
      The named CE machines will be exonerated from applying the standard PPS updates for the whole duration of the pilot. The WMS instead will need special care because the non-standard extra configuration will have to be maintained throughout possible future service upgrades

Phase2 (TASK:7981)

  • Phase2 - Initial Planning:
    Cream-phase2-initial-planning-081002.gif

Phase2, started on the 1st October 2008. It is focused on the performances of the ICE WMS

The objective of Phase2 is to enable CMS users to submit continuously at a rate of 10Kjob/day over 5 weeks

The proposed layout for the pilot is

  • 2 WMSs: CNAF and (FZK or SCAI)
  • 1 UI CNAF: cert-ui-01.cnaf.infn.it
  • 1 BDII CNAF: including services in pilot + LCG production
  • 1 VOBOX(if needed): FZK
  • CREAM CEs: FZK (Angela Poschlad), Padova (14 ones, 7 PBS , 7 LSF; resp. Sara Bertocco), Bari (Giacinto Donvito), CNAF (~ 10 ones, resp Daniele Cesini): SCAI (Klare Cassirer?)
    The CREAM CEs will access production batch systems

Requirements for the software to be deployed in the pilot:

  1. CVS tag defined and yum repository (at CNAF) available
  2. SW documented in patches in state >= "With provider"
  3. SW accompanied by a "certification of deployability" delivered by SA3 Italy (Alessio Gianelle)

Methodology:

- updates of the SW at sites, if needed, will be applied by SA1 personnel on demand of the developers - patches "approved" by the pilot will be set to "ready for certification"

Description of the patches to be expected (provided off-line by Massimo):

  1. new version of ICE to install over a WMS with PATCH:1841
  2. new version of yaim-wms for possible modifications in configuration
  3. a patch of CREAM/CEMon/BLAH (rpms glite-ce* to be installed on CREAM CE)
  4. a patch of yaim-cream-ce
  5. a patch of CREAM-CLI

Patches in c/d/e/ could be released earlier to the standard certification track to fix possible issues in the CREAM now in production

  • Initial roadmap:
    1. Prepare formal tasks for SA1 participants (timeline: ~3days)
    2. Integrate three new sites in the pilot framework (timeline: ~4days in parallel with 1)
    3. Prepare patches (timeline: ~4 days in parallel with 1)
    4. Upgrade of the sites in the pilot (timeline:~1 week)
    5. Open to users

Phase 2 closed on 31st of April 2009 after the indication of a set of patches to be deployed ASAP in the production system as a pre-condition for phase 3 Activity a) (see PPIslandFollowUp2009x04x01) . This issue is followed-up with GGUS:47489

Phase3 (TASK:9504)

The phase three, launched during the meeting of the 1st April is composed of two activities:
  • Activity a) WMS-ICE validation, this is the main activity.

Activity a)

Activity b)

Considering the points mentioned in the transition plan, and the outcomes of the pilot meeting of the 1st of April, SA3 will focus on a subset of points, initially using the direct submission to the CREAM CE. The points to verify are:

J "At least 5000 simultaneous jobs per CE node". "5000 simultaneous jobs" means that the CE should be able to handle 5000 jobs submitted simultaneously to the CE. The CE should be able to handle them all the way (sending them to the batch system, polling for the status, etc.). But it doesn't mean that all the 5000 jobs will be run at the same time. They probably won't (it depends on how busy the batch system/worker nodes are). So, as long as the 5000 simultaneously submitted jobs are properly treated by the CE, the test is passed.

K "Unlimited number of user/role/submission node combinations from many VO's (at least 50), up to the limit of the number of jobs supported on a CE node". This test will probably be done in a small and controlled setup (non production) using a fake CA as is usually done by the certification team.

M "Job failures due to restart of CE services or reboot < 0.1%".

O "Graceful failure or self-limiting behavior when the CE load reaches its maximum (e.g. if a CE node can support only 5000 jobs it must not crash or become unresponsive with more than that)".

At a second time, with the use of a WMS from the pilot, these points can be evaluated:

D.i "The ICE / CREAM job submission chain should be able to meet all performance criteria and otherwise perform at least as well as the WMS / LCG CE submission chain". This point depends on BUG:47911

D.ii "The ICE-WMS must deal gracefully with large peaks in the rate of jobs submitted to it.", although an agreement must be reached on what is considered as 'gracefully'.

Testing point J ("At least 5000 simultaneous jobs per CE node")

This section describe the plan for executing the tests aimed at the verification of the point J of the transition plan.
The following table shows the fixed parameters at each target site and link to the page used to collect results.

  CE LRMS #job slots submission rate* MySQL host* Results
GRNET - PBS - - yes report
CERN ce110.cern.ch:8443/cream-lsf-grid_2nh_dteam LSF - - no -
FZK pps-cream-fzk.gridka.de:8443/cream-pbs-pps PBS 500 - no -
PIC - Condor - - no -

* The submission rate parameter depends on other variables, it still needs to be clarified what could be the fixed parameter.
* MySQL host (yes/no) means whether the site has a separate MySQL host or not.

Technical documentation

Installation of Cream CE

To install a CREAM CE please follow the instructions reported at:

http://igrelease.forge.cnaf.infn.it/doku.php?id=doc:guides:devel:install-cream31-devel

Some remarks:

  • In the instructions, in the " YUM repository setup " section, three possible options are listed. Please refer to the third one ( For the CREAM PPS pilot related activities)
  • Please check (and in case remove) other CREAM related repo files in /etc/yum.repos.d (e.g. coming from older installations)
  • In the instructions, you will note that sometimes different actions must be taken if you are considering RPMs v. 1.8.x (patch #1755) or if you are instead considering RPMs v. greater than 1.8.x. We are in the latter scenario (i.e. rpms v. greater than 1.8.x)
  • Please publish TestbedB as value to be published for the GlueCeStateStatus attribute. This is done, as explained in Appendix A, via the variable CREAM_CE_STATE, to be specified in <your-siteinfo.def-dir>/services/glite-creamce

Please then publish your CREAM CEs in a dedicated CREAM site-bdii (you can host this CREAM site-bdii in one of the CREAM CE), Please choose a new (i.e. non existing) basedn for this CREAM site bdii.

Installation of ICE WMS (Phase2)

To install a WMS ICE enabled please follow the usual instructions reported here referring to the installation of "glite-WMS".

However there are some differences to consider with respect to the official guide:

  1. The middleware repository (see "The middleware repositories" section): besides the production one, please also consider this file to be copied in /etc/yum.repos.d

yum Repositories for Phase2

Here's the instructions to use them:

SLC4-native-compiled i386 glite-CREAM_ce 3.1:

It is a yum repository:

  • create the file /etc/yum.repos.d/glite-CREAM_ce_i386.repo containing the following lines:
[glite-CREAM_ce_i386]
name= glite-CREAM_ce i386
baseurl=http://grid-it.cnaf.infn.it/apt/glite/pps/pilot/CREAM_CE/glite-CREAM_CE/sl4/i386/
enabled=1
  • run: yum update

SLC4-native-compiled i386 ICE enabled glite-WMS_ICE 3.1:

It is a yum repository:
  • create the file /etc/yum.repos.d/glite-CREAM_ce_i386.repo containing the following lines:
[glite-WMS_ICE_i386]
name= glite-WMS_ICE i386
baseurl=http://grid-it.cnaf.infn.it/apt/glite/pps/pilot/CREAM_CE/glite-WMS_ICE/sl4/i386/
enabled=1
  • run: yum update

SLC4-native-compiled i386 ICE enabled glite-LB 3.1:

It is a yum repository:

  • create the file /etc/yum.repos.d/glite-LB_i386.repo containing the following lines:
[glite-LB_i386]
name= glite-LB i386
baseurl=http://grid-it.cnaf.infn.it/apt/glite/pps/pilot/CREAM_CE/glite-LB/sl4/i386/
enabled=1
  • run: yum update

Yum repositories for Phase 3

Pilot Layout

Phase1 (closed on September)

  • Cream CEs:
-CNAF: cert-ce-03.cnaf.infn.it + 4 virtual WNs using pbs, 7 queues (alice,atlas,cms,lhcb,ops,dteam) pps
-FZK: pps-cream-fzk.gridka.de

  • ICE WMS:
-FZK: pps-rb-fzk.gridka.de
-SCAI: glite-wms2.scai.fraunhofer.de

  • Available CLIs:
-CNAF: cert-ui-01.cnaf.infn.it
-FZK: pps-vobox-fzk.gridka.de (alice prod setup)

Phase2

UI

WMS + ICE

  • cert-rb-01.cnaf.infn.it (CMS-CRAB test) since 11-Mar-09 this WMS points to the production BDII
  • glite-wms2.scai.fraunhofer.de

Cream CEs

LSF Cream

INFN-PADOVA

  • cream-10 SL4, batch master LSF
  • cream-21 SL4, CE with LSF
  • cream-22 SL4, CE with LSF
  • cream-23 SL4, CE with LSF
  • cream-24 SL4, CE with LSF
  • cream-25 SL4, CE with LSF
  • cream-26 SL4, CE with LSF
  • cream-27 SL4, CE with LSF - site BDII
  • prod-wn-001 SL4, WN with LSF
  • prod-wn-002 SL4, WN with LSF
  • prod-wn-003 SL4, WN with LSF
  • prod-wn-004 SL4, WN with LSF
  • prod-wn-005 SL4, WN with LSF

PBS Cream

INFN-PADOVA

  • cream-28 SL4, CE with pbs - batch master
  • cream-29 SL4, CE with pbs
  • cream-30 SL4, CE with pbs
  • cream-31 SL4, CE with pbs
  • cream-32 SL4, CE with pbs
  • cream-33 SL4, CE with pbs
  • cream-34 SL4, CE with pbs - site BDII
  • prod-wn-006 SL4, WN with PBS
  • prod-wn-007 SL4, WN with PBS
  • prod-wn-008 SL4, WN with PBS
  • prod-wn-009 SL4, WN with PBS
  • prod-wn-010 SL4, WN with PBS

CNAF LSF CREAM CE CLUSTER

  • cert-04.cnaf.infn.it SL4 LSF CE
  • cert-05.cnaf.infn.it SL4 LSF CE
  • cert-06.cnaf.infn.it SL4 LSF CE
  • cert-07.cnaf.infn.it SL4 LSF CE
  • cert-08.cnaf.infn.it SL4 LSF CE
  • cert-09.cnaf.infn.it SL4 LSF CE
  • cert-13.cnaf.infn.it SL4 LSF CE

CNAF PBS CREAM CE CLUSTER

  • cert-ce-03.cnaf.infn.it SL4 PBS CE (CMS-CRAB test)

Site BDII:

  • ldap://cream-27.pd.infn.it:2170/mds-vo-name=INFN-PADOVA-CREAMTEST,o=grid
  • ldap://cream-34.pd.infn.it:2170/mds-vo-name=INFN-PADOVA-CREAMTEST-PBS,o=grid
  • INFN-BARI-CREAMTEST ldap://cream-ce-1.ba.infn.it:2170/mds-vo-name=INFN-BARI-CREAMTEST,o=grid
  • INFN-CNAF-CREAM ldap://cert-07.cnaf.infn.it:2170/mds-vo-name=INFN-CNAF-CREAM,o=grid

Phase3

Activity a)

Activity b)

This list has to be verified:

  • FZK: 1 CREAM with PBS (12 cores available through the queue 'pps')
  • CNAF: WMS + LSF CEs
  • PIC ?
  • CERN : 1 CREAM pointing to the production queues (LSF) and WNs.
  • GRNET: 1 CREAM with PBS (on a separate node) and 5 WNs. CREAM MySQL DB on a separate host.

Results

Phase1

Feedback from the experiments

Alice discussed the progresses on the Pilot in several task forces meetings

General comments on installatIon procedure

Special configuration done in FZK to support Alice

SAM

SAM test for cream are now available in the PPS SAM instance

Nagios

No progresses with Nagios were recorded during phase1. The testing activity is transferred to Phase2

Phase2

Feedback from the experiments

The whole phase2 was characterised by a reduced interaction with the experiments with respect to phase1. Two reason for that:

  1. the concurrent deployment of the first version of CREAM in production brought Alice to work almost exclusively with production sites
  2. the bulky installations at PADOVA and CNAF were often reserved for stress testing
  3. delays in other sites to set-up comparable alternative testbeds.

Alice confirmed that the set of patches defined in GGUS:47489 is compatible with the use they intend to do in production

General comments on installation procedure

The installation and configuration of the products with YUM + YAIM is fully supported and of production quality.

SAM

Results of direct and WMS based submission test against the CREAM CEs in the pilot are now available on the SAM PPS portal (https://pps-sam.cern.ch:8443/sam/sam.py),

Nagios

As the SAM mechanism is now available and instances of CREAM are available in production the responsibility of the corresponding development of Nagios was fully moved to OAT

CREAM and ICE Development

Several performance and scalability issues affecting the ICE--> CREAM submission chain were identified and followed up. Most of them were solved by the set of patches identified by GGUS:47489. The performance issue reported with BUG:47911 and identified as a showstopper for the full replacement of the lcg-CE is understood and the solution is under testing

Phase3

Activity a)

Activity b)

  • (10 June 2009) Testing point K of transition plan: a test with 64 distinct users (4 VOs, 4 roles, 4 attributes) has been successfully completed. 100 jobs (simple hostname) per user have been submitted and completed successfully. The test has been performed using the certification CA, with a machine where CREAM (current production version) and 4 WNs are installed. The scripts used for this test are here. The only issue found was the one related to bug #48083.
Next step will be submitting larger amount of jobs up to 5k simultaneously present in the CE.

Credits

This pilot has been conceived taking into account the plans of JRA1/SA3 developed within the cluster of competence and the SA3 guidelines for certification

CREAM and ICE precertification
================================

Foreword
-------
The new certification model in the EGEE-III project foresees a
pre-certification phase done by the so-called "cluster of competence".
A cluster of competence in charge of the pre-certification of a certain
software component is composed by the JRA1 developers of that component,
and by the SA3 people close (i.e. local) to them. The collaboration
of SA1 is also foreseen.
The pre-certification phase is then followed by a formal certification process
usually done by a partner different than the one in charge of the development
and of the pre-certification of that component.
This formal certification phase is supposed to be very quick, since most (all)
of the problems should have been found and addressed during the
pre-certification step.

So the Italian cluster of competence is in charge of the pre-certification
of CREAM and ICE


Testbeds
--------
For the pre-certification of CREAM and ICE, we envisage the use of
2 testbeds:

- Testbed a: small testbed, supposed to be used by the cluster of competence
  people, that is by the CREAM and ICE developers and by the SA3 Italian
  people. This testbed is supposed to be used mainly for functionality tests
  and for some limited stress tests.


- Testbed b: larger testbed, supposed to be used mainly by experiment people,
  and in particular for scalability tests.
  This testbed basically corresponds to the "experimental services" considered
  so far for the WMS


Hardware requirements for pre-certification testbeds
----------------------------------------------------
For testbed a:
         1 UI node
    1 WMS node
    1 LB node
    1 BDII comprising both CREAM and LCG based CEs (the LCG based CEs
           are the production ones)
    4 CREAM based CEs, possibly distributed in different sites

For testbed b:
    2 WMS nodes
    1 LB node
    1 BDII comprising both CREAM and LCG based CEs (the LCG based CEs are
           the production ones): this can be the same BDII used in testbed a
    At least 20 CREAM based CEs, possibly distributed in different sites

For what concerns the WNs, it is not necessary to have dedicated machines
for such testbeds but the same WNs used in production can be used.
It is just a matter to reserve a certain number of slots (e.g. 50) to the
queues dedicated to the CREAM pre-certification tests.


Configuration of the testbeds
-----------------------------
On both testbeds it is suggested to devote 2 queues per CREAM CE (so 2 CEIDs
per CREAM CE machine) to these tests.

For what concerns the VOs to enable, on testbed b the "production" VOs should
be authorized. On testbed a also the 21 "fake" VOs should be enabled, so
that testers can perform tests submitting jobs on behalf of multiple users
belonging to different VOs.

Updates on the testbeds
-----------------------
Both testbeds are supposed to be updated (WMSes and/or CREAM CEs) whenever
a new blocking issue is found and/or whenever a certain number of new
fixes for non-blocking problems should be tested.
It is supposed that testbed a will be updated much more often than
testbed b.
Testbed b should be updated only after having updated testbed
a, and after having tested (by the cluster of competence people) on
the testbed a the updated version.

Software deployed on the certification testbeds must be tagged
(tags will be done on the proper "pre-certification" CVS branches)


-------- Original Message --------
Subject: CREAM testing plan
Date: Fri, 23 May 2008 15:12:20 +0200
From: Oliver Keeble <oliver.keeble@cern.ch>
To: Markus Schulz <Markus.Schulz@cern.ch>,  Francesco Giacomini <francesco.giacomini@cnaf.infn.it>, Di Qing <Di.Qing@cern.ch>


My summary of the plan;

Plan and criteria for CREAM certification

Two broad criteria

   * scalability at the level previously defined in CE acceptance criteria
   * functionality verified based on the CLI spec, direct verification of
the web service interface, and a set of CREAM tests functionally
equivalent to those currently run against the lcg-CE

Andreas to organise with Massimo a functional test plan, and to find
resources for writing the tests (should not take more than a week).

CERN will add extra resources to the test infrastructure currently used
to validate the lcg-CE, including a CREAM CE and a production-level
(non-ICE) WMS. CERN will run the tests, including the 5 day soak.

Comment - we are *not* certifying ICE, and would like to avoid
difficulties in interpreting test results if possible. Di will make a
judgement as to whether we can simply loop over the CREAM CLI tests we
will have in order to do the appropriate scalability validation. If so,
this is the approach we will take. If not, we will take the ICE rpms and
upgrade the testbed WMS. There's is a question over testing
proxy-renewal if we don't use the WMS.

When CREAM is released to production, it will be advertised as being
made available for larger sites to install in parallel with their
existing lcg-CEs, not as a replacement for them. In this way we will
soon have a pool of CREAM CEs exposed to production work patterns and
loading, without endangering availability of resources.

After initial release, responsibility for CREAM scalability/stress
testing will pass to INFN and CERN certification will not invoke such tests.

History

25-Jun-08: SAM - duplication of SAM sensor slowed down by SAM unavailability

2-Jul-08: The decision is made to extend phase1 until the 22nd of July (see PPIslandFollowUp2008x07x01)

22-Jul-08: The decision is made to extend phase1 until the 26th of August (see PPIslandFollowUp2008x07x22)

2-Sep-08: The decision is made to extend phase1 until the 30th of September (see PPIslandFollowUp2008x09x02)

1-Oct-08: The decision is made to close phase1 and start phase2 (see PPIslandFollowUp2008x10x01)

28-Nov-08: New version of CREAM available for the installation on the CREAM PPS pilot

09-Dec-08: Alice is starting a new stream of activity on the pilot at CNAF

09-Dec-08: Pilot end-date moved to end of January.

11-Dec-08: A request was sent to PPS site admins and the EGEE regional managers to join for an extension of the pilot

12-Dec-08: There is a new yaim-cream-ce (v. 4.0.7-2) in the YUM repo for the CREAM PPS pilot (PATCH:2667).

13-Jan-09: A new version of CREAM was release to the pilot. This version fixes BUG:45437 and BUG:45736.

13-Jan-09: within the SA1 coordination meeting the SA1 ROCs were invited to use the pilot version of CREAM for their regional installation

13-Jan-09: Stress test of the ICE+CREAM submission chain: A submission rate of 40 job/min was sustained but a failure rate higher that expected was observed. The issue is currently under analysis

13-Jan-09: Pilot end-date moved to mid-March.

20-Jan-09: Alice tested successfully the CLI using the CE at FZK.

03-Feb-09: PIC joined the pilot with a setting-up multiple CREAM CEs accessing the production queues

03-Feb-09: The high failure rate observed affecting the ICE+CREAM submission chain wasanalised and the causes fixed. Now the system sustains correctly a submission rate of 40 jobs/min. Stress tests with long lasting jobs seem to show performance issues when the number of active jobs in the system increases:

03-Feb-09: Alice requests CERN to propose a timeline for the deployment of CREAM in production

25-Feb-09: Results of direct ans WMS based submission test available on the SAM PPS portal (https://pps-sam.cern.ch:8443/sam/sam.py),

11-Mar-09: ICE-WMS cert-rb-01.cnaf.infn.it reconfigured in order to allow CMS to use the production instances of CREAM

12-Mar-09: CMS testing in progress against ice-WMS at CNAF

30-Mar-09: GGUS:47489 opened to track the release in production of a set of patches validated by the pilot

10-Jun-09) Testing point K of transition plan: a test with 64 distinct users (4 VOs, 4 roles, 4 attributes) has been successfully completed.

Topic attachments
I Attachment History Action Size Date Who Comment
GIFgif Cream-phase2-initial-planning-081002.gif r1 manage 11.0 K 2008-10-03 - 11:37 AntonioRetico Phase2 - Initial Planning
GIFgif creampilotph1.gif r1 manage 6.4 K 2008-06-26 - 01:04 AntonioRetico Initial plan for phase 1
Edit | Attach | Watch | Print version | History: r55 < r54 < r53 < r52 < r51 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r55 - 2009-10-01 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback