EGEE Web>SA1>StagedRolloutSA1 (2009-11-23, unknown)

Authors: Antonio Retico(SA1)
First Published: 3-Jul-2009
Last Update: 2009-11-23 by Main.unknown
Last Content Review: 2-Jul-2009
Expiration: 1-Mar-2009 (after this date please contact the author before using the content of this page)

Staged roll-out of grid middleware: Implementation details and roadmap for SA1

Scope and goals of this document
Staged roll-out and Operations: Implementation Strategy
Infrastructure for staged roll-out
Transition Plan
- General pre-condition to start the implementation
- Implementation plan of a staged transition
Production sites willing to join

Scope and goals of this document

This document is addressed to site and regional managers in SA1. It is meant to :

describe the implementation details of the the process to roll-out middleware updates to the production service in a controlled way, limited to the aspects relevant for Operations (SA1/EGI-OU). Specifically it aims to provide answers to the following points
1. Registration: How to identify the sites involved in the early adoption
  - .i.e. How to know which site to contact for feedback on a given product/component update?
2. What the participating sites are required to do
3. How to communicate with sites
4. How to report issues
5. Metrics
draft an initial plan for the transition from the current roll-out model to the new one, to be completed before the end of EGEE-III.

This document does not cover implementation nor transition details related to the middleware distribution (packaging, SW management, documentation, incident management) which are dealt with in a separate document by the UMD/EGI-MU working group.

For an overview of the staged roll-out process please refer to the document describing the general lines [1] .

Staged roll-out and Operations: Implementation Strategy

The staged roll-out is supposed to replace completely in EGI the processes of PPS Deployment Test and Release Test as they are defined in EGEE.

In order to prototype conveniently the model during the 2nd year of EGEEIII without causing service disruption and without unsustainable requirements for the regions in terms of participating resources, a feasible implementation strategy could be articulated in two main phases.

Phase 1: re-conversion of existing PPS resources to support the new process.
- the new set of procedures, documentation and tools (e.g SW repositories, release notes, registration, communication, reports) for the staged roll-out in production will be tried out by the PPS and PROD sites currently supporting the deployment testing and the release testing
- although the exposure of the software to production usage is not attained in this phase the new process is mimicked and refined without impacting on any production sites (apart from those ones already concerned by the release testing). At the same time the key function of deployment testing will be preserved (actually it will be enhanced, because based on the "production version" of the middleware meta-packages)
- during this phase the SA1 coordination (PPS coordination) still acts centrally to manage the destination of PPS resources (which of the PPS sites are covering which services). The option is left to the site and regional management to start a re-distribution of their resources in order either to transform their PPS sites into new production ones or to re-absorb the PPS resources in the production sites. This transformation can be handled locally and within locally defined timelines, but the commitment of the services and service managers to the staged-roll out must be preserved.
Phase 2: Once the procedures, documentation and tools are consolidated a campaign starts in the regions to invite more production sites to join and ramp-up to the service targets.
- the management of the resources is fully transferred to the ROCs: it will be the responsibility of the ROCs to recruit the best sites to act as early adopters and coordinate the commitments in the region.
- the role of the SA1 Coordination (OCC) is to monitor the situation, identify critical areas (e.g. important products/components not covered) and follow them up with the regions.

Infrastructure for staged roll-out

Registration of sites and services

Sites and services dedicated to the staged roll-out need to be listed in order e.g. to properly address the communication.

Service end-points participating to the staged roll-out are flagged in the GOC DB. The flag is managed by the site and regional managers as any other site attribute.

Pro: GOCDB is a well established tool --> no need for sites and ROCs to learn new things
Pro: the participation or not to staged-roll-out has actually to be supported by the site administrator, so the GOCDB sounds like a semantically appropriated place where to store this information.
Con: a new requirement is put on the GOCDB which will take some time to be implemented though (order of months)

Communication

The following communication streams have to be put in place:

OU → EA sites : Notification that a new release is available and request to upgrade
EA site → OU : Notification that the upgrade was performed (+ installation and configuration outcome)
EA site → OU : Report of issue found during operations
OU → EA sites : request to roll-back to old version

OU → EA sites: Notification that a new release is available and request to upgrade

This stream is implemented via a formal "upgrade_task" object issued by the OU with the following attributes :

Submitter: it is the OU for sites which have formally committed to do the staged roll-out, but the same template can be used by the NGI for internal testing activity. If used information from the "internal" tasks could be shared as well
assigned_to: the EA site
due_date : it defines the deadline for the EA site to upgrade
status : Assigned|In Progress|Waiting|Done|Rejected|Canceled
deployment_outcome: OK | FAILED | WARN (to be filled by the EA site as a feedback on the installation and configuration procedure)
[Optional] report : free text pre-formatted field to be filled by the EA site (e.g. to detail installation issues or to explain a WARN)

The task body will contain of course all the relevant links to documentation and repositories to be used (this information is not supposed to change frequently though and should be permanently available on web pages as well)

The tasks can be specialised per product/component in terms of due date (e.g. different deadlines can be set for upgrade of different products/components)

A proposed value for the "standard" due date is

due date = submission date + 3 working days

EA site → OU : Notification that the upgrade was performed

upgrade_task.status is set to "Done" and the upgrade_task.outcome is set to OK|FAILED|WARN

EA site → OU : Report of issues found during operations

This is done via the standard tools (GGUS tickets and/or Savannah bug reports). References of a (single) GGUS ticket or of (multiple) bug reports can be linked to the task.

For the classification of the Savannah bugs a new "Bug Detection Area"(e.g. = "Beta Service") is identified

OU → EA sites : request to roll-back to old version

This request is implemented via e-mails sent to appropriate e-mail lists

If the request is sent while some of the upgrade_tasks are still active their status is set to "Canceled"

Reports and metrics

No explicit reports are requested to the EA sites regarding the operations of a new version of the middleware. The decision whether to move a release forward to the full production or not is done based on the reported issues and on a "quarantine" period defined per product/component.

The default "quarantine" period is set to two weeks (this value can be overruled on a product/component base)

Objective metrics to monitor the quality of the process and fo the releases can be identified and statistics calculated e.g.

# of upgrade_tasks finished on-time/late
Time needed to the EA sites to upgrade (closure of the ticket)
# of critical problems found at this stage per component
# of non-critical problems found at this stage per component

With communication based on e-mail these statistic cannot be automatically collected.

Visualisation

The possibility to query and visualise the list of sites and services involved in the process is important for the several reasons e.g.:

the OU and MU can analyse the distribution of and identify possible critical areas (e.g. components not covered)
Specialised views of the monitoring displays (e.g. gridmap, gridview and SAM) can be selected in order to verify that the complex of sites that have upgraded are behaving correctly. This is important for the OU to confirm the decision to move the update forward
If needed and required corrective weights to the site availability numbers could be applied for the concerned sites

The envisaged standard method to manage information for particular groups of sites is the use of the topology database. So an interface between the registry and the topology database to export this information will have to be provided.

Service targets

Number of sites/services needed to take part to the staged roll-out process at a steady state.

This number is not defined "a priori" and will be probably different per service type.

If the staged roll-out process turns out to be useful and successful this number will settle down naturally at the site and at the regional level.

Tentatively and just as a first working approximation we could set the following service target:

The number of service instances in production committed to this process should range from 1 unit up to the 5% of the total per deployment scenario

Transition Plan

General pre-condition to start the implementation

Before any transition activity involving the sites can start we need the "delta" repository and the preview release notes to be available at least for one product/component, according to what defined in the process general lines [1] .

We indicate the date when the pre-condition is met as Day0 . Day0 is used as reference start date for the implementation plan described in the following sub-sections.

A few transition tasks that can start before Day0 are:

Ts0-1 : definition and submission of requirements for GOCDB
Ts0-2 :change the configuration information for the PPS sites in the PPS registry (to be used as a temporary registry waiting for the GOCDB to be ready)

Implementation plan of a staged transition

Day0 is used as reference start date for the implementation plan described in the following sub-sections. The elapsed time (Working days) estimated for the transition from one stage to the other is calculated assuming that

1 FTE works full time on the implementation
one single task is performed by one single person

Ts = Elapsed time needed by 1 FTE to perform all the listed tasks in series TP = Elapsed time needed by 1 FTE to perform all the listed tasks in parallel whenever possible

elapsed time = (Ts+Tp)/2 ± (Ts-Tp)/2

Stage0 (Starting condition): Day0

The pre-production infrastructure [2] as it is today (pilot/experimental services excluded)

deployment testing:
- 6 PPS sites covering the test of 26 services
- sites and nodes registered in GOCDB (TYPE=PPS)
- commitments registered in PPS registry
- centrally coordinated (task distribution and reporting)
- communication via e-mail (tracked via a "PPS release" GGUS ticket )
release testing:
- 5 production sites covering the test of 6 services
- sites registered in GOCDB (TYPE=Production)
- commitments registered in PPS registry
- upgrade of the sites not mandatory within the release process
- positive or negative acknowledge required from the sites for update requests
- positive or negative feedback from the sites that applied the upgrade required
- communication done via e-mail (tracked via a "production release" GGUS ticket )
~ 3-4 active PPS sites not involved in any of the above
- sites and nodes registered in GOCDB (TYPE=PPS)
- completely managed by the ROCs
- no acknowledge required on the upgrades
- notified of new releases via EGEE broadcasts
SAM monitoring infrastructure
gridmap display
list of PPS sites and services taken from the GOCDB available

Stage1 (PPS sites acting as EA): Day0 + 16 ± 10 = Day1

New set of procedures, documentation and tools in place and used by PPS and PROD sites currently supporting the deployment testing and the release testing

~10 PPS and PROD sites mimicking the EA of all the available products/components (those released in independent repositories)
sites and nodes registered in GOCDB (TYPE=PPS and PROD)
commitments registered in PPS registry
upgrade of the sites mandatory
feedback on the installation from the sites that applied the upgrade required
installation feedback used as replacement of the "deployment test reports"
installation feedback fetched and published automatically from the tasks
central EA coordinator responsible for:
- task distribution
EA management procedure available (for operations managers)
EA operational procedure available (for sites)
communication via Savannah tasks (tracked)
SAM results available
SAM display including EA sites available
gridmap display including EA sites available
comprehensive list of EA sites available (from PPS registry)

Tasks for the transition Stage0 → Stage1

Ts0s1-1 : 2pd: prepare information page with links to "delta" repositories and release notes
Ts0s1-2 : 2pd: re-group the sites in the PPS registry in order to make a single group of "EA" sites out of the "deployment testing" and "release testing" sites. Update the PPS documentation accordingly.
Ts0s1-3 : 2pd: create new templates for notifications to the EA sites (communication via e-mail).
Ts0s1-4 : 2pd: change the "PPS deployment test reports" page into a new display showing the feedback on the installation by EA
Ts0s1-5 : 1pd: Update (remove) the gLite release procedure for PPS (this is part of a bigger change to be done at the level of the general release procedure in order to reflect the new deployment model).
Ts0s1-6 : 1pd: Contact the sites (meeting) to make sure that the changes in the operational procedure are understood and accepted by everybody
- Actually the most relevant operational change is for the "release testing" sites. With the new process a deadline is set
Ts0s1-7 : 1pd: Interface PPS registry --> Web pages: automated comprehensive list of EA sites
Ts0s1-8 : 1pd: create new templates for notifications to the EA sites (communication via tasks)
Ts0s1-9 : 4pd: Re-write the "deployment test report" page in a way that the information is fetched and aggregated from the completed upgrade_tasks .
Ts0s2-10 : 4pd : develop merged SAM display
Ts0s2-12 : 4pd : develop merged gridmap display
Ts0s1-13 : 1pd: Update the EA management procedure .
Ts0s1-14 : 1pd: Update the EA operational procedure .
Ts0s1-15 : 1pd: Contact the sites (meeting) to make sure that the changes in the operational procedure are understood and accepted by everybody
- the most relevant operational change for sites is the use of Savannah tasks instead of e-mails
Ts1s1-16 : 3pd: Set-up of displays for task-tracking and performance metrics (e.g. time to close the tasks)

Responsibility: PPS coordination ,(SAM)

Estimated overall elapsed: 10 ± 10 working days

Stage2 [conditioned to availability of GOC DB and Topology database] (PPS sites acting as EA, commitments in GOCDB): MAX(Day1,DayG,DayT) + 8.5 ± 1.5 = Day2

PPS and PROD acting as EAs . Communication based on tasks

~10 PPS and PROD sites mimicking the EA of all the available products/components (those released in independent repositories)
sites and nodes and commitments registered in GOCDB (TYPE=PPS an PROD)
upgrade of the sites mandatory
feedback on the installation from the sites that applied the upgrade required
installation feedback used as replacement of the "deployment test reports"
installation feedback fetched and published automatically from the tasks
central EA coordinator (OU) responsible for:
- task distribution
- overall monitoring of operational issues related to the release
EA management procedure available (for operations managers)
EA operational procedure available (for sites)
communication via tasks (tracked)
SAM results available
EA sites grouped in topology database
comprehensive SAM display available
comprehensive gridmap display available
comprehensive list of EA sites available (from topology DB)

Tasks for the transition Stage1 → Stage2

Prerequisites:

flag "Early Adopter" available in the GOCDB (DayG)
topology database in production (DayT)

Ts2s3-1 : 2pd: map the commitments from the PPS registry into the GOCDB
Ts2s3-2 : 2pd: Interface GOCDB --> topology DB: export of EA sites.
Ts2s3-3 : 2pd: Interface topology DB --> SAM: creation of SAM display for EA sites.
Ts2s3-4 : 3pd: Gridmap: creation of Gridmap display for EA sites.
Ts2s3-5 : 1pd: Interface topology DB --> Web pages: automated comprehensive list of EA sites

Responsibility: PPS coordination, (SAM), (OAT)

Estimated overall elapsed: 8.5 ± 1.5 days

Stage3 [start of phase2] (extension to production sites responsibility transferred to the regions): Day 2 → End of EGEEIII (Final stage)

EA process fully in place

production sites doing EA of all the available products/components (those released in independent repositories)
commitments of production sites negotiated by regions/NGIs
sites, nodes and commitments registered in GOCDB (TYPE=PROD)
upgrade of the sites mandatory
feedback on the installation from the sites that applied the upgrade required
installation feedback used as replacement of the "deployment test reports"
installation feedback fetched and published automatically from the tasks
central EA coordinator (OU) responsible for:
- task distribution
- overall monitoring of operational issues related to the release
- identification of areas of improvement
local EA coordinators (NGIs) co-responsible for:
- creation and distribution of local EA tasks
- monitoring and escaation of local operational issues related to the release
EA management procedure available (for operations managers)
EA operational procedure available (for sites)
communication via tasks (tracked)
SAM results available
EA sites grouped in topology database
comprehensive SAM display available
comprehensive gridmap display available
comprehensive list of EA sites available (from topology DB)

Tasks for the transition Stage2 → Stage3

Ts3s4-1 : n.a.: recruiting. Information campaign to recruit more production sites in order to meet the service targets
Ts3s4-2 : n.a.: training. Personnel from the regions/NGIs is supposed to be active part of the EA program in order to:
- support the process by identifying and recruiting appropriate sites
- extend the process by issuing local EA programs (and sharing the results by using common tools)

Responsibility: SA1 coordination, ROC/NGI managers

Estimated overall elapsed: n.a.

Production sites willing to join

Production sites willing to be part of the EA group are allowed to step-in at any of the stages described above.