LCG Web>LCGGridDeployment>GLitePreProductionServices>EGEE_PPS_Coordination>StagedRolloutSA1WorkingVersion (2009-07-02, AntonioRetico)

EditAttachPDF

Authors: Antonio Retico(SA1)
First Published: Never (internal copy)
Last Update: 2009-07-02 by AntonioRetico
Last Content Review: 1-Jul-2009
Expiration: 1-Oct-2009 (after this date please contact the author before using the content of this page)

Staged roll-out of grid middleware: Implementation details and roadmap for SA1

Scope and goals of this document
Staged roll-out and Operations: Implementation Strategy
Infrastructure for staged roll-out
Transition Plan
- General pre-condition to start the implementation
- Implementation plan of a staged transition
Production sites willing to join

Scope and goals of this document

This document is addressed to site and regional managers in SA1. It is meant to :

describe the implementation details of the the process to roll-out middleware updates to the production service in a controlled way, limited to the aspects relevant for Operations (SA1/EGI-OU). Specifically it aims to provide answers to the following points
1. Registration: How to you identify the sites involved in the early adoption
  - .i.e. How to know which site to contact for feedback on a given product/component update?
2. How to communicate with sites
3. How to report issues
4. Metrics
draft an initial plan for the transition from the current roll-out model to the new one, to be completed before the end of EGEE-III.

This document does not cover implementation nor transition details related to the middleware distribution (packaging, SW management, documentation, incident management) which are dealt with in a separate document by the UMD/EGI-MU working group.

For an overview of the staged roll-out process please refer to the document describing the general lines [1] .

Staged roll-out and Operations: Implementation Strategy

The staged roll-out is supposed to replace completely in EGI the processes of PPS Deployment Test and Release Test as they are defined in EGEE.

In order to prototype conveniently the model during the 2nd year of EGEEIII without causing service disruption and without unsustainable requirements for the regions in terms of participating resources, a feasible implementation strategy could be articulated in two main phases.

Phase 1: re-conversion of existing PPS resources to support the new process.
- the new set of procedures, documentation and tools (e.g SW repositories, release notes, registration, communication, reports) for the staged roll-out in production will be tried out by the PPS and PROD sites currently supporting the deployment testing and the release testing
- although the exposure of the software to production usage is not attained in this phase the new process is mimicked and refined without impacting on any production sites (apart from those ones already concerned by the release testing). At the same time the key function of deployment testing will be preserved (actually it will be enhanced, because based on the "production version" of the middleware meta-packages)
- during this phase the SA1 coordination (PPS coordination) still acts centrally to manage the destination of PPS resources (which sites are doing what). The option is left to the site and regional management to start a re-distribution of their resources in order either to transform their PPS sites into new production ones or to re-absorb the PPS resources in the production sites. This transformation can be handled locally and within locally defined timelines, but the commitment of the services and service managers to the staged-roll out must be preserved.
Phase 2: Once the procedures, documentation and tools are consolidated a campaign starts in the regions to invite more production sites to join and ramp-up to the service targets.
- the management of the resources (which sites are doing what) is fully transferred to the ROCs
- the role of the SA1 Coordination (OCC) is to monitor the situation, identify critical areas (e.g. important products/components not covered) and follow them up with the regions.

Infrastructure for staged roll-out

Registration of sites and services

Sites and services dedicated to the staged roll-out need to be listed in order e.g. to properly address the communication.

Service end-points participating to the staged roll-out are flagged in the GOC DB. The flag is managed by the site and regional managers as any other site attribute.

Pro: GOCDB is a well established tool --> no need for sites and ROCs to learn new things

Pro: the participation or not to staged-roll-out has actually to be supported by the site administrator, so the GOCDB sounds like a semantically appropriated place where to store this information.

Con: a new requirement is put on the GOCDB which will take some time to be implemented though (order of months)

Alternative solution 1 (possibly to be used as temporary implementation workaround waiting for the GOCDB): Sites and services taking part in the staged roll-out are registered in the PPS Activity Management Registry , a custom db integrated with Savannah currently used in PPS.

Pro: registration of the site to an activity fully supported and easy to do

Pro: web interface available supporting authentication and authorisation for site and regional managers

Con: registration of single services (nodes) not supported --> may need extension

Con: not recognised by OAT among the standard "Tools for Operations" --> not clear how to support it after the end of EGEE

Con: procedure to register new sites in the system frankly appalling (not real-time, requires the creation of a squad in Savannah)

Alternative solution 2: Sites and services participating to the staged roll-out are flagged in the information system.

Pro: No new tools or changes needed

Con: appropriate attributes (e.g. Service.QualityLevel) available only in GLUE2 --> not available before ??? (may take longer than the implementation on the GOCDB and there are no compatible workarounds possible)

Con: requires configuration at the level of the middleware (configuration of services used for the staged roll-out technically different from thse used for production)

Con: risk that service end-points flagged as running "unstable" versions end up to be filtered out by default by the end-users, so reducing the value of the staged-roll-out

Communication

The following communication streams have to be put in place:

OU → EA sites : Notification that a new release is available and request to upgrade
EA site → OU : Notification that the upgrade was performed (+ installation and configuration outcome)
EA site → OU : Report of issue found during operations
OU → EA sites : request to roll-back to old version

OU → EA sites: Notification that a new release is available and request to upgrade

This stream is implemented via a formal "Upgrade task" object with the following attributes issued by the OU:

Submitter: it is the OU for sites which have formally committed to do the staged roll-out, but the same template can be used by the NGI for internal testing activity. If used information from the "internal" tasks could be shared as well

assigned_to: the EA site

due_date : it defines the deadline for the EA site to upgrade

status : Assigned|In Progress|Waiting|Done|Rejected|Canceled

deployment_outcome: OK | KO | WARN (to be filled by the EA site as a feedback on the installation and configuration procedure)

[Optional] report : free text pre-formatted field to be filled by the EA site (e.g. to detail installation issues or to explain a WARN)

The task body will contain of course all the relevant links to documentation and repositories to be used (this information is not supposed to change frequently though and should be permanently available on web pages as well)

The tasks can be specialised per product/component in terms of due date (e.g. different deadlines can be set for upgrade of different products/components)

A proposed value for the "standard" due date is

due date = submission date + 3 working days

NOTE:
An implementation based on the Savannah tool is currently available and used within the PPS pilot/experimental services. It requires the site administrators to be registered in Savannah (process currently not automated).

A very useful feature (currently not supported by the infrastructure) would be the possibility to query from the information system the (meta-package) versions of the services run at the sites. With this feature available, utilities to monitor the process could be easily built up without strict need to use explicit tasks

As an alternative solution we could use mailing lists and track the replies in GGUS (as it is done e.g. for the current gLite release management process). This approach though is a but clumsy to handle and it generates a number of GGUS tickets not related to user support. Actually the use of GGUS tickets to track the release is being abandoned by the gLite release team. It would be possible to use GGUS if it offered an interface to automate the bulk submission of tickets addressed to special groups of sites (currently not in the plan and out of the scope of the application).

EA site → OU : Notification that the upgrade was performed

upgrade_task.status is set to "Done" and the upgrade_task.outcome is populated.

EA site → OU : Report of issues found during operations

This is done via the standard tools (GGUS tickets and/or Savannah bug reports).

For the classification of the Savannah bugs a new "Bug Detection Area"(e.g. = "advanced production") is identified

OU → EA sites : request to roll-back to old version

This request is implemented via e-mails sent to appropriate e-mail lists

If the request is sent while some of the upgrade_tasks are still active their status is set to "Canceled"

Reports and metrics

No explicit reports are requested to the EA sites regarding the operations of a new version of the middleware. The decision whether to move a release forward to the full production or not is done based on the reported issues and on a "quarantine" period defined per product/component.

The default "quarantine" period is set to two weeks (this value can be overruled on a product/component base)

If the proposed method of distribution of formal tasks is put in place objective metrics to monitor the process can be identified and statistics calculated e.g.

# of upgrade_tasks finished on-time/late
Time needed to the EA sites to upgrade (closure of the ticket)

With communication based on e-mail these statistic cannot be automatically collected.

Visualisation

Being able to query and visualise the list of sites and services involved in the process is important for the several reasons e.g.:

the OU and MU can analyse the distribution of and identify possible critical areas (e.g. components not covered)
Specialised views of the monitoring displays (e.g. gridmap, gridview and SAM) can be selected in order to verify that the complex of sites that have upgraded are behaving correctly. This is important for the OU to confirm the decision to move the update forward
If needed and required corrective weights to the site availability numbers could be applied for the concerned sites

The envisaged standard method to manage information for particular groups of sites is the use of the topology database. So an interface between the registry and the topology database to export this information will have to be provided.

Service targets

Number of sites/services needed to take part to the staged roll-out process at a steady state.

This number is not defined "a priori" and will be probably different per service type.

If the staged roll-out process turns out to be useful and successful this number will settle down naturally at the site and at the regional level.

Tentatively and just as a first working approximation we could set the following service target:

The number of service instances in production committed to this process should range from 1 unit up to the 5% of the total per deployment scenario

Transition Plan

General pre-condition to start the implementation

Before any transition activity involving the sites can start we need the "delta" repository and the preview release notes to be available at least for one product/component, according to what defined in the process general lines [1] .

We indicate the date when the pre-condition is met as Day0 . Day0 is used as reference start date for the implementation plan described in the following sub-sections.

A few transition tasks that can start before Day0 are:

Ts0-1 : definition and submission of requirements for GOCDB
Ts0-2 :change the configuration information for the PPS sites in the PPS registry (to be used as a temporary registry waiting for the GOCDB to be ready)

Implementation plan of a staged transition

Day0 is used as reference start date for the implementation plan described in the following sub-sections. The elapsed time (Working days) estimated for the transition from one stage to the other is calculated assuming that

1 FTE works full time on the implementation
one single task is performed by one single person

Ts = Elapsed time needed by 1 FTE to perform all the listed tasks in series TP = Elapsed time needed by 1 FTE to perform all the listed tasks in parallel whenever possible

elapsed time = (Ts+Tp)/2 ± (Ts-Tp)/2

Stage0 (Starting condition): Day0

The pre-production infrastructure [2] as it is today (pilot/experimental services excluded)

deployment testing:
- 6 PPS sites covering the test of 26 services
- sites and nodes registered in GOCDB (TYPE=PPS)
- commitments registered in PPS registry
- centrally coordinated (task distribution and reporting)
- communication via e-mail (tracked via a "PPS release" GGUS ticket )
release testing:
- 5 production sites covering the test of 6 services
- sites registered in GOCDB (TYPE=Production)
- commitments registered in PPS registry
- upgrade of the sites not mandatory within the release process
- positive or negative acknowledge required from the sites for update requests
- positive or negative feedback from the sites that applied the upgrade required
- communication done via e-mail (tracked via a "production release" GGUS ticket )
~ 3-4 active PPS sites not involved in any of the above
- sites and nodes registered in GOCDB (TYPE=PPS)
- completely managed by the ROCs
- no acknowledge required on the upgrades
- notified of new releases via EGEE broadcasts
SAM monitoring infrastructure
gridmap display
list of PPS sites and services taken from the GOCDB available

Stage1 (PPS sites acting as EA, communication via e-mail): Day0 + 8 ± 2 = Day1

New set of procedures, documentation and tools in pace and used by PPS and PROD sites currently supporting the deployment testing and the release testing

~10 PPS and PROD sites mimicking the EA of all the available products/components (those released in independent repositories)
sites and nodes registered in GOCDB (TYPE=PPS an PROD)
commitments registered in PPS registry
upgrade of the sites mandatory
feedback on the installation from the sites that applied the upgrade required
installation feedback used as replacement of the "deployment test reports"
central EA coordinator responsible for:
- task distribution
- organisation and publication of installation feedback
EA management procedure available (for operations managers)
EA operational procedure available (for sites)
communication via e-mail (not tracked)
SAM results available
comprehensive SAM display NOT available
comprehensive gridmap display NOT available
comprehensive list of EA sites available (from PPS registry)

Tasks for the transition Stage0 → Stage1

Ts0s1-1 : 2pd: prepare information page with links to "delta" repositories and release notes
Ts0s1-2 : 2pd: re-group the sites in the PPS registry in order to make a single group of "EA" sites out of the "deployment testing" and "release testing" sites. Update the PPS documentation accordingly.
Ts0s1-3 : 2pd: create new templates for notifications to the EA sites (communication via e-mail).
Ts0s1-4 : 2pd: change the "PPS deployment test reports" page into a new display showing the feedback on the installation by EA
Ts0s1-5 : 1pd: Update (remove) the gLite release procedure for PPS (this is part of a bigger change to be done at the level of the general release procedure in order to reflect the new deployment model).
Ts0s1-6 : 1pd: Contact the sites (meeting) to make sure that the changes in the operational procedure are understood and accepted by everybody
- Actually the most relevant operational change is for the "release testing" sites. With the new process a deadline is set
Ts1s1-7 : 1pd: Interface PPS registry --> Web pages: automated comprehensive list of EA sites

Responsibility: PPS coordination

Estimated overall elapsed: 8 ± 2 days

Stage2 [optional] (PPS sites acting as EA, communication via tasks): Day1 + 7.5 ± 3.5 = Day2

PPS and PROD acting as EAs . Communication based on tasks

~10 PPS and PROD sites mimicking the EA of all the available products/components (those released in independent repositories)
sites and nodes registered in GOCDB (TYPE=PPS an PROD)
commitments registered in PPS registry
upgrade of the sites mandatory
feedback on the installation from the sites that applied the upgrade required
installation feedback used as replacement of the "deployment test reports"
installation feedback fetched and published automatically from the tasks
central EA coordinator responsible for:
- task distribution
EA management procedure available (for operations managers)
EA operational procedure available (for sites)
communication via tasks (tracked)
SAM results available
comprehensive SAM display NOT available
comprehensive gridmap display NOT available
comprehensive list of EA sites available (from PPS registry)

Tasks for the transition Stage1 → Stage2 [optional]

Prerequisite: decision to use Savannah tasks to track the process

Ts1s2-1 : 1pd: create new templates for notifications to the EA sites (communication via tasks)
Ts1s2-2 : 4pd: Re-write the "deployment test report" page in a way that the information is fetched and aggregated from the completed upgrade_tasks .
Ts1s2-3 : 1pd: Update the EA management procedure .
Ts1s2-4 : 1pd: Update the EA operational procedure .
Ts1s2-5 : 1pd: Contact the sites (meeting) to make sure that the changes in the operational procedure are understood and accepted by everybody
- the most relevant operational change for sites is the use of Savannah tasks instead of e-mails
Ts1s2-6 : 3pd: Set-up of displays for task-tracking and performance metrics (e.g. time to close the tasks)

Responsibility: PPS coordination

Estimated overall elapsed: 7.5 ± 3.5 days

Stage3 [conditioned to availability of GOC DB and Topology database] (PPS sites acting as EA, commitments in GOCDB): MAX(Day2,DayG,DayT) + 8.5 ± 1.5 = Day3