Staged roll-out brainstorming series: release management

  • Date: Fri 05 Jun 2009
  • Agenda: na
  • Description: Staged roll-out brainstorming series: release management
  • Chair: none

Attendance

Oliver, Maarten, Antonio, Maite, Steve

Purpose of the meeting

General goal: to agree between SA3 and SA1 on a strategy to do a phased roll-out of the middleware to a number of "early adopter" sites. The process has to be compliant to the future middleware distribution model from EGI_DS . We focus our analysis on two real use cases: WMS 3.2.1-4 (now in PPS) and BDII 5 (currently in certification).

Material:

Relevant points of discussion

short summary of the last meeting

Clarification. The repositories we are talking about are "operations" repositories managed by EGI.eu MU on behalf of the OU. The whole staged roll-out process is transparent to the product teams. From the perspective of the product team the release "to production" is completed when they release the mw into the BETA repository.

release management

Several points were left open from the last meeting all of which could be reduced to the same key question:

How do we consider the release to the early adopter sites a full release?

  1. as a PPS-like release
    • announced separately to friendly sites
    • documented by a separate release page
  2. as a standard production release
    • documented in the mw default release pages
    • announced to everybody with a broadcast

Oliver re-phrased the two options as opt-in and opt-out models for the sites respectively.

Pros and cons of the opt-in vs. opt-out approaches for the release were discussed in sparse order. In the end the opt-in approach was selected as preferable (and therefore better detailed)

opt-in
DESCRIPTION

  • The default Update pages stay unchanged showing only version N
  • The possibility to use newer version (N+1) than the one in the production repository (N) is advertised in a mw Welcome page that describes how to register for the process
  • Sites willing to be early adopters of a given service have to explicitly "sign-in" (e.g. by registering to a mailing list) for that service.
  • The announce of N+1 update is done through these restricted channels. The announce points to a NEXT mw update pages
  • "NEXT" mw Update page exist showing the history of the updates for a given service including N+1
  • After the quarantine NEXT release pages are bulk-copied int the CURRENT release pages (or deleted upon failure)
  • Broadcasts to all production sites are sent only after the quarantine period has successfully passed
PROS
  • production system well protected (hard for a newbie site to step on an unstable release by mistake)
  • Structure of the announcements unchanged
  • easier incident management (controlled lists of early adopters)
CONS
  • more difficult for sites to realise that the early adoption process exists (needs good communication at the central level)
  • more expensive for MU to manage and for SA3 to transition to (double release pages)

opt-out
DESCRIPTION

  • announces of update N+1 done to everybody with disclaimers ("please update only if ...")
  • not further detailed because the model was rejected
PROS
  • transparent from the documentation perspective (version N+1 is the top version of the mw release pages).
  • Less work for MU to manage and for SA3 to transition to
  • easier access for sites wanting to try out the latest and greatest version
CONS
  • the majority of sites (including the inexperienced ones) have to bear in mind that the stable version is not N+1 but the last but one (N). This is an important change in the distribution model (eventually to be hammered in with a very loud campaign)
  • difficult failure management (if the middleware fails there is no direct control over who has upgraded what)

Serialisation of releases

The model proposed above still implies that the roll-out of Updates N, N+1, and N+2 for a given service is serialised. In other words, until the quarantine period of N+1 isn't finished, the staged roll-out of N+2 cannot start.
However this should cause less pain in the future due to the separation of the different products and to the fact that they will be released with all the components managed independently.
Leap-frogging of versions towards production will be admitted for exceptional reasons (e.g. security updates) possibly acting on the release number of the meta-package (see previous meeting for details)
In general if the roll-out of version N+2 needs to be anticipated this will be done by replacing tout-court version N+1 in the process (by obsoletion)

How to motivate sites to be EA

Potentially sites adopting young releases could see their performance reports (availability and reliability) suffer because of that. They should be rewarded somehow for taking the risk otherwise partners would step back. Currently the trade-off we are offering is "time vs. stability": sites may get earlier in time updates they are directly or indirectly interested in but they have to pay a price for that (putting the stability at risk and committing to give feedback)

Managing incidents in the PRODUCTION repository

Maarten commented on the use of the EPOCH number as proposed by Steve as a mechanism for a fast roll-back. He pointed out that the mechanism was analysed several years ago and then discarded because not correctly supported by rpm, apt, yum and quattor . Steve confirms that quattor now supports the EPOCH. It is commonly agreed that from the perspective of the rpm version numbers displayed the approach is cleaner. Apparently though there is no process gain, in the sense that still the rpms need to be re-built in order to change the epoch.
So -as we did previously for the "downgrading by upgrading" practice- the use of this mechanism is not set as an operational requirement but it is left as a possible implementation of the roll-back process, the application of which may eventually be decided by case by the integrators.

Roll out of BDII5.0 and WMS3.2

we decided not to use these packages to pilot the staged roll-out process in this phase because the process has to be consolidated and a premature bad or incomplete implementation may result in loosing consensus.

Plans for the immediate future

  • Antonio will draft a document describing the high lines of the process we discussed as well as some slides to be presented to the ROC managers at the next face2face SA1 coordination meeting
  • Antonio will initiate a document to describe in more details the implications for SA1 (changes and implementation plan for the upcoming year) with reference to the analogous document that Oliver and Andreas are writing for SA3

Summary of decisions

  • use of a "delta" Next repository
  • no pps naming convention for meta-packages
  • no gLite update number
  • sign-in policy for sites
  • NEXT version of release pages

Thanks to all participants for their contribution


Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2009-06-05 - AntonioRetico
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback