Sample operational model use cases [Ongoing work]

Some common use cases about fictional events are quickly documented below.

Incident management

L3: Power outage at DE-KIT leading to routers down

Router operators at DE-KIT will open a LHCOPN TT

  • Link IDs impacted are all those linked to DE-KIT (including DE-KIT-I-II-LHCOPN-001).
  • Sites impacted are all those linked to DE-KIT.
  • Ticket category is "Incident L3"
  • Ticket is self assigned to DE-KIT

The LHCOPN TT is submitted and all affected sites are notified by e-mail by the LHCOPN TTS. The router operators at DE-KIT will then warn their local Grid Data contacts to notify them about the LHCOPN outage (the power cut may not be in the whole datacenter). Link to LHCOPN TT is provided.

Later, power is back, routers and links are now fully up again. Router operators at DE-KIT will then close the LHCOPN TT, updating datetime with end of outage, and reporting anything needed in the ticket. All affected sites will be notified by the LHCOPN TTS. Router operators at DE-KIT will also warn their local Grid data contacts about the end of the outage.

L2: Fibre cut between London and Didcot affecting CERN-RAL-LHCOPN-001

Router Operators in UK-T1-RAL noticed that their link is down thanks to their monitoring system. They exchange with JANET about the trouble, and JANET told them that there is a fibre cut between London and Didcot, the outage duration is still unknown. UK-T1-RAL has no LHCOPN backup, it is fully disconnected from the LHCOPN.

Router Operators in UK-T1-RAL will then open a LHCOPN TT:

  • Impacted sites are CH-CERN and UK-T1-RAL
  • Impacted link ID is CERN-RAL-LHCOPN-001
  • Ticket category is incident L2
  • Ticket is self assigned to UK-T1-RAL
Ticket is submitted and all impacted sites are notified by the LHCOPN TTS through e-mails.

The UK-T1-RAL's Grid data contacts are notified thank to an internal site procedure, and link to the related LHCOPN TT is provided to them.

Any interesting updates coming from JANET to UK-T1-RAL should be reflected into the LHCOPN TT by Router operator.

Later JANET informed UK-T1-RAL that link is repaired and router operators noticed this is ok and the L3 is also up again. LHCOPN TT is then closed (by router operators of UK-T1-RAL) with datetime of end of outage précised in the ticket. All affected sites are notified by the LHCOPN TTS.

Local Grid data contacts are warned by router operators about the end of the outage.

Change management process

L3: Change of IOS version for LHCOPN's router at NDGF

NDGF needs to perform a low priority IOS upgrade of their LHCOPN router. Outage is considered less than one hour.

Local Grid data contacts and linked sites will not be warned about the change has it has no strong impact (the change, not its implementation). The change is fully documented into the change management database for history and reference.

An informational ticket with a link to the change management database entry is put on the LHCOPN TTS. The LHCOPN TTS will notify all sites potentially impacted (for instance all linked sites having a BGP peer as the new IOS version can break routing...) by e-mails.

In our case the change will have an impact (all links down during the upgrade), so it must be implemented with a maintenance, so the L3 maintenance management process is started. The LHCOPN TT for the change remains opened and assigned to NDGF.

Grid data contacts and maybe linked sites are contacted to try to find a suitable date to perform the disturbing maintenance on the router. The outage is considered less than one hour so the notice delay is 2 days. Thus a LHCOPN TT is created at least two in advance in LHCOPN TTS (kind: maintenance L3, link IDs affected: all linked, impact: connectivity, assigned to NDGF) with start date and end date also filled. All impacted sites are notified by the LHCOPN TTS by e-mails.

Any site has one day to complain if necessary. Otherwise the maintenance is silently accepted. The D day the maintenance is performed, and after LHCOPN TT is updated and closed by router operators in the LHCOPN TTS.

The maintenance is complete. Associated LHCOPN TT for the change is updated, at least with reference of the maintenance, and closed. The change is complete.

L3: New IP prefixes for ES-PIC

Local ES-PIC Grid data contact is warned by ES-PIC Router Operators about the new prefixes (some hosts may now be into the LHCOPN...). Router operators at ES-PIC will go on the change management database to fully document the change.

A LHCOPN TT is created summarizing the change (kind is "Informational", ticket is assigned to ES-PIC and URL to the change management database entry is provided). All sites are impacted and will be notified (so that they can update theirs filters for instance).

DANTE Operations and the ENOC are warned by e-mail to have the monitoring system updated (new prefixes might need to be monitored...).

The change has no impact on existing services so no maintenance need to be performed. The LHCOPN TT could be closed (you can even directly create the ticket closed).

L2: New LHCOPN link CERN-TRIUMF-LHCOPN-00X

CA-TRIUMF has ordered a new L2 link to CH-CERN. CA-TRIUMF is just warned by its network providers the link is now available. Router operators have configured L3 and link is working fine. The process is a L2 change management, embedding some L3 changes (IPs...).

Router operator in CA-TRIUMF will expose changes to their local Grid Data contact (new bandwidth, new resiliency possibility etc.). Router operator will update the change management database about the new link and will update all technical information required on CERN's twiki (new link ID, new network map, new L3 addresses, new NOC contacts...) - An Informational ticket is put on the LHCOPN TTS and at least all sites affected are warned (US-T1-BNL, CH-CERN, NL-T1 - because they can benefit of new resiliency possibilities or bandwidth). - DANTE Operations are warned to have monitoring (MDM and E2EMON) adapted for the new link. The ENOC is warned for the BGP monitoring and to have new link ID supported by the LHCOPN TTS.

This change has no bad impact on existing service, so no need to implement it with a maintenance. The link is then in production.

Maintenance management process

L3: Reboot of routers at CH-CERN

CH-CERN needs to reboot routers l513-c-rftec-1 and l513-c-rftec-2 (this is not a change as the event discussed is only a reboot). The impact should be less than one hour, hence the notice windows is two days.

So at least two days before the event Router Operators at CH-CERN will contact their local Grid Data Contacts to be sure there is no special event on Grid side (service challenge, key transfers...). Optionally all sites affected could be asked if this is ok for them. This will be done off the record by e-mail for important outage. As a router reboot should be short this will not be done in current case.

Once the suitable date is agreed a ticket is created in advance in the LHCOPN TTS by the router operator of CH-CERN (Category: "Maintenance L3", Impacted sitenames: All, assigned to CH-CERN, network problem = connectivity, links impacted = all linked to CH-CERN). The scheduled start date and end date are also filled.

The ticket is submitted and after one day the maintenance is silently agreed by everyone. The LHCOPN TT ID will be provided to the local Grid data contact for reference.

Later the maintenance is performed, the LHCOPN TT is updated and closed. The event is terminated.

L2: USLHCNET's scheduled power cut for devices in Chicago

USLHCNET reported to US-FNAL-CMS and CH-CERN that devices in Chicago will be down for two hours due to a scheduled power cut. The fictional impact is US-FNAL-CMS fully disconnected. We have a T0-T1 outage, so responsibility is on T1: US-FNAL-CMS has the responsibility of the event for the LHCOPN community.

Router operators in US-FNAL-CMS will open a LHCOPN TT (Impacted: US-FNAL-CMS and CH-CERN, category: "Maintenance L2", Impact: connectivity) reflecting the outage. Scheduled Start and end time are provided thanks to USLHCNET information. LHCOPN TT is submitted and all impacted sites are notified by the LHCOPN TTS. Local Grid data contacts are also warned by US-FNAL-CMS' router operators to warn the Grid.

GGUS ticket about the outage is updated by US-FNAL-CMS's router operators. Later when the outage is terminated, LHCOPN TT about the maintenance is updated and closed.

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2009-01-14 - GuillaumeCessieux
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCOPN All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback