EGEE emergency procedure for the CERN site in case of an Important but not Major Incident.
An important incident that is not a major incident is typically an
incident that, concerns an outage of a single service or more than one closely related
services.
Examples are: failure of a number of AFS machines, important network routers
malfuctioning, important external Internet links unavailable, severe
database problems etc
During extended working hours 8:00-18:00 Mon-Fri
Either the CERN ROC or the Manager On Duty (MOD) will take care of declaring the downtime in the GOCDB or
sending an EGEE broadcast, as appropriate. If a broadcast is sent, the IT manager on duty
mod@cernNOSPAMPLEASE.ch and the CERN ROC
egee-roc-cern@cernNOSPAMPLEASE.ch (if not included as a recipient already) should be copied as well.
Details on how to declare a downtime in GOCDB can be found on
http://cic.gridops.org/index.php?section=home&page=SDprocedure
The link below should bring you directly to the form to declare a down time for CERN-PROD.
To access the GOCDB a certificate should be loaded in the browser.
https://goc.gridops.org/downtime/add?site_path=1.1.2
alternatively the CERN-PROD site page has a link at the bottom to declare a down time
https://goc.gridops.org/site/list?id=160
Outside of extended working hours including weekends and holidays
There is nothing foreseen at the moment.
EGEE emergency procedure for the CERN site in case of a Major incident.
By Major Incident we mean an incident where for instance the network is down, there is an
electrical problem in the Computing Centre (CC), or a large part of the Cern infrastructure is unavailable.
Other examples are loss of ventilation, fire, flooding, serious electrical problem, major network outage etc and any incident that
involves serious personal injuries.
CERN is referred to in the EGEE jargon as the site "CERN-PROD".
During extended working hours 8:00-18:00 Mon-Fri
The Mod (Manager on Duty) will send a broadcast using the CIC portal and declare the down time in the GOCDB (details provided above).
To send an EGEE broadcast one should have a CERN CA certificate loaded in the browser. Please refer to
the [[https://ca.cern.ch/ca/]CERN CA website]] if you need assistance in requesting a certificate or loading it into your browser.
The website address of the EGEE broadcast is
https://cic.gridops.org/index.php?section=rc&page=broadcast
targets should be:
- WLCG Tier-1 contacts
- CIC on duty
- All Roc managers
- All production site admins
- All VO managers
The options used for the last broadcast sent for the latest CERN wide power cut is
https://cic.gridops.org/index.php?section=roc&page=broadcastretrieval&step=2&typeb=C&idbroadcast=33023
- Screen shot of a typical broadcast in case of a power cut:
Outside of extended working hours including weekends and holidays
The CC Ops will deal with the incident.
Their job will be to notify by phone a list of contacts of sites that have 24h support and
can send a broadcast on behalf of CERN.
A list of contacts can be found
here. It is extracted from the site contacts provided in the
GOC database (GOCDB).
Provisionally this list will be maintained by the CERN ROC. In the future this list will be moved to either the CIC portal or the
GOCDB itself and automatically updated there.
Still to be clarified
-A list of services which if are down or affected and to which degree
(eg Remedy down for >30 mins etc), a notification should be sent during working hours
- Any additions/mods to the list of services provided for the case 1 above for the operators.
--
DianaBosio - 06 Jun 2008