PPS Pilot Follow-up Meeting Minutes Wed 1 Apr 2009

  • Date: Wed 01 Apr 2009
  • Agenda: 54688
  • Description: Pilot of Cream CE: check-point
  • Chair: Antonio Retico

Attendance

  • PPS: Antonio Retico
  • CERN: Sophie Lemaitre
  • FZK: Angela Poschlad

  • CMS: Absent
  • Alice: Patricia Mendez

  • JRA1/Cream/WMS: Massimo Sgaravatto
  • SA3: Gianni Pucciani, Alessio Gianelle
  • SA1: Nick Thackray

Review of action items (tasks)

Not covered.

Status and results of the pilot service (by VOs and sites)

Not covered.

Status and results of the development (by developers)

Updates given by Massimo. This is the combined view

  • next foreseen CREAM/CEMon/BLAH patch is supposed to be PATCH:2666. Actually we are likely going to change its number (cloning it) before releasing it, to avoid confusion, since its number is < than the one of latest released one (PATCH:2748)
  • if there are some urgent fixes to be done, we can release a new patch before releasing #2666, which just these needed fixes
  • Concerning ICE we are testing the fix for BUG:47911)

Gianni: no news from GRNET who is certifying the three patches, the message to perform a "quick" certification has been delivered.

Massimo: PATCH:2748 has to be used for verifying the acceptance criteria.

Antonio: the release of the new patches had to be moved ahead to the 14th of April since the certification estimate was too optimistic

Open Issues (by VOs, sites, deployment teams)

See below.

Recommendations for release and deployment

See below.

Decision about termination/extension of the pilot

Antonio: At the current status of the things there is no need in my opinion for the Cream pilot to keep running. On the other hand the need to verify the criteria suggested by for replacing lcg-CE requires a new round of testing to be done and consequently some resources to be set-up and organised at this purpose. I propose to re-use the pilot working group to coordinate all this by opening a Phase3 of the pilot articulated as follows: Two main streams of activities (objectives) in the pilot Cream Phase 3, a) Validation of ICE WMS (main activity), b) Support to test of CREAM performance criteria (accessory activity)

For a) initially one instance of ICE WMS running a version will be inserted in the production framework of CMS. This instance will use instances of CREAM deployed in production instead of pointing to CREAM resources in the pilot as it is now. Additional instances of WMS will be installed on demand. The resources in Padova and CNAF are released by the pilot and fall back in the complete control of the cluster of competence. Alice’s direct involvement in the pilot with direct submission tests can be stopped as there will be no accessible CREAM resources deployed.

For b) sites already committed to the pilot (PIC and FZK, CNAF) (and maybe CERN could join too) will be requested to deploy a small test infrastructure (e.g. 1 cream CE per site with limited number of WN) based on PATCH:2748 to be used as target by the testing teams in SA3 for the duration of the performance tests, to be run by SA3 on behalf of SA1

Gianni: In SA3 two partners, PNPI and GRNET are interested in doing performance tests on CREAM, it would be good to coordinate this effort and target the points of the transition plan. It would be good that at each site participating in this activity both an lcg-CE and a CREAM CE would be deployed. Nick: some of the tests have absolute metrics, the first point of the plan is not very precise and is more focused on the user's perception then on absolute numbers. Gianni: so, do you think is not mandatory to have 2 CEs at each site to compare the performance? Nick: if it does not slow things down, we can compare, but only if we have resources and time. Antonio: having two CEs would be good to eliminate possible dependency on WMS problems. Nick: but the results could be biased by inefficiency in the WMS-ICE chain, with the WMS you don't exercise the same path with CEs and CREAM. I think using the direct submission is enough, but I see the value added if we have both CREAM and lcg-CE, but if does not slow things down. For the 1st point of the plan we could look at the production usage.

Antonio: considering the variable LRMS, FZK is based on PBS, PIC should have LSF. Massimo: Condor is supported but stress tests needs to be done by PIC. Nick: testing PBS and LSF is good enough for now, since 90% of the users use them. Massimo: BQS is still a work in progress, at least 1 months of work yet. Gianni: for SGE a person from SA3 is testing it. Antonio: ideally, all of them should be tested, but we can start with PBS and LSF.

Antonio: about the deployment, GRNET tested the installation of the CREAM DB on a separate host. It could be interesting to test. I would ask to Angela to install a CREAM CE in the deployment suggested by SA3. The CREAM with PBS can be provided by FZK. The one with LSF by PIC. If PIC is not available would CERN provide that? Sophie: I have to check, I will let you know.

Antonio: by next week Gianni will propose a plan to state the tests that will be done. I will ask PIC what kind of hardware they have, they should have Condor and LSF Massimo: AFAIK, at PIC they don't have LSF. Antonio: ok, then in this case we have to find an LSF CREAM CE, if not PIC hopefully CERN.

Nick: one of the criteria states 5K simultaneous jobs, we need to have enough job slots. Antonio: 5K jobs simultaneous what does it mean? Nick: I assume running, but I will check. Sophie: I will ask Ulrich.

Gianni: another point in the transition plan was >= 50 users, roles, VOs combinations. Antonio: this can be tested in a controlled environment (with the SA3 test CA),

Nick: what should be the baseline version to start the tests? A production release, while for independent tests (ex 50 users) we can use non production patches. Antonio: FZK and PIC are running PATCH:2748, we can start testing the direct submission on it. A WMS is present at CNAF and available for the first activity. The goal will be to move the ce status to production, at that point the pilot will be successful.

Points to verify: 5K simultaneous jobs, 50 users, graceful failure. Nick: the job failure due to service reboot should be also verified

Antonio: production version by the 14th of April , so tests will start after that. Angela could prepare her setup from the 14th of April. Tests could start around the 23rd. We could close the Pilot around middle of May, hopefully earlier. One month of unattended usage is a metric that will be taken from production

Open Question

What minimum version of WMS we need?

Is CERN providing resources?

What does 5k simultaneous job exactly means?

Resources from PIC.

AOB


Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2009-04-02 - GianniPucciani
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback