LHCOPN Web>OpsFAQ (2009-09-29, GuillaumeCessieux)

FAQ on the operational model

This is FAQ around the LHCOPN operational model. Feel free to submit things to be included into to Guillaume.

FAQ on the operational model

If my primary link is down, is this a connectivity problem or a performance problem?

You should have in the GGUS ticket:
- Problem Kind: "Connectivity" , because you have a link down.
- Service Impacted: "None service affecting" or "Performance degradation" or "Possible performance degradation" regarding your topology and expected link status.

Should we report incident ASAP and then investigate or contrary?

As you want: You can wait - reasonably - to be able to open a very accurate ticket, or you can quickly open a raw ticket to give visibility to the event before starting investigating. For issue having a strong impact on the service it is preferred quickly opening tickets to broadcast the trouble.

I am a site, I detected a problem but I am unable to know which site is faulty, what should I do?

Assign trouble to CH-CERN (default catch-all assignment), they may take advantage of their central position to successfully troubleshoot the issue before re-assigning it to the faulty site.

What is the process to release a new link (after it is just physically set up)?

The process is a L2 change management (root change), embedding some L3 required changes (IPs...).

Router operator will update the change management database about the new link and will update all technical information required on CERN's twiki (new link ID, new network map, new L3 addresses, new NOC contacts, expected routing policies...). DANTE Operations are warned to have monitoring (MDM and E2EMON) adapted for the new link. The ENOC is warned for the BGP monitoring and to have new link ID supported by the LHCOPN TTS. This is done in the background (DANTE or ENOC won't act in the LHCOPN TTS) and the site is responsible to follow that off the record. A link cannot be in production if it is not monitored. Contacts details for operation are here.

If the change has no disturbing impact on existing service (i.e only infrastructure enhancement) there is no need to implement it with a maintenance, an informational ticket is enough (if this has impact then the informational ticket should be replaced with a maintenance ticket). Then ticket (with the box 'this is a change' checked) is put on the LHCOPN TTS and all affected sites are ticked to be warned (because they can benefit of new resiliency possibilities or bandwidth). The time window is free, but a 1 week monitoring before claiming production quality is often very reasonable (are you detecting and notified about link's downtimes?). When the ticket will be closed the link will be officially in production.

Router operator will also expose changes to their local Grid Data contact (new bandwidth, new resiliency possibility for the project etc.).

Backup test process should also be (regularly) launched to verify any (new) claimed possibilities for resiliency.

Topic revision: r6 - 2009-09-29 - GuillaumeCessieux

LHCOPN

LHCOPN Web
LHCOPN Web Home
Changes
Index
Search

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LHCOPN All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback