LHCOPN Web>OperationalModel (2010-11-24, GuillaumeCessieux)

Proposed LHCOPN operational model

The scope of this operational model is only the LHCOPN, which precise list of sites and links is detailed on page NamingConventionAndLinksIDs. A LHCOPN link is a dedicated link part of the network specifically put in place to allow distribution of data from T0 to T1s.

Foundations

Drawing conventions

Main lines of processes are explained below them, in a hierarchical way:
1.1 Process one, step one
1.1 Process one, step two

2.1 Process two, step one
2.2 Process two, step two

The process should be done in order while it is often possible to change steps' order without breaking global processes.

Actors

LCG: Large Hadron Collider Computing Grid
L2 NOC: Network operating centre for L2 services
DANTE: Delivery of Advanced Network Technology to Europe http://www.dante.net/
NREN: National Research and Education Network http://en.wikipedia.org/wiki/National_research_and_education_network
GÉANT2: The Pan-European network http://www.geant2.net/ , managed by DANTE
Sites: LHC T0/T1s
Router Operators: People in charge of network devices on sites
Grid Data contact: People in charge of the data transfers occurring on the LHCOPN. They are the main users of the LHCOPN.
- This is a generic role in charge of the interactions with the Grid world (impact assessment & broadcasting...) - Could be implemented by anybody, but e.g Grid people
DANTE Operation:
- Role: Supervising and coordinating L2 and L3 monitoring deployment
ENOC - EGEE Network operating centre
- Role: Help Designing processes for the LHCOPN, fit with Grid operations and drive design of the LHCOPN TTS
LQA: LHCOPN Quality assessment
- Role: Statistics and assessment of infrastructure and processes

Actors and information repositories management

The responsibility depicted is about setting up and ensuring the working of information repositories, not about theirs contents.

Information repositories location:

L2 monitoring: http://stats.geant2.net/e2emon/mon/G2_E2E_index_PROD.html [DANTE account needed]
- Each L2 noc has responsibility for its own probe - DANTE Operation is responsible for supervising and coordinating the deployment
L3 monitoring
- MDM: http://lhcopn-mdm.geant.net/portal/
The global web repository is CERN's twiki. It must not store any tickets: they will be into the LHCOPN TTS.
- Operational procedures: https://twiki.cern.ch/twiki/bin/edit/LHCOPN/OperationalModel
- Operational contact: https://twiki.cern.ch/twiki/bin/view/LHCOPN/OpsContacts
- Technical information: https://twiki.cern.ch/twiki/bin/view/LHCOPN/WebHome part "Technical Information"
- Change management DB: https://twiki.cern.ch/twiki/bin/view/LHCOPN/ChangeManagementDatabase - Private area on CERN's twiki
- Statistic reports: https://twiki.cern.ch/twiki/bin/view/LHCOPN/LhcopnStatistics - Private area on CERN's twiki
LHCOPN TTS: Helpdesk within the GGUS system: It will store LHCOPN related network tickets: https://gus.fzk.de/lhcopn [Read view with any valid certificate]
- EGEE SA1 is managing the LHCOPN TTS (the ENOC - EGEE network operating centre - drives its design)
The planning is a particular view of LHCOPN trouble tickets (i.e mapping of pending tickets on a calendar): https://cclhcopnmon.in2p3.fr/LHCOPN/webcalendar/

The private area on CERN's twiki is accessed (read and write) to anyone authenticated on CERN twiki (Nice account, twiki light account...). A registration page is here. The aim is not to prevent people to access information but more to avoid some potentially sensitive information to be fully disclosed (changes in IPs, ACL, security, report on the LHCOPN...) and indexed by web crawlers.

Information access

Processes

The thresholds are:

Any event with an impact on the service must be reported with at least 1 ticket per issue not per event.
For non service impacting events, those lasting more than 1 hour or occurring more than 5 times an hour should be reported in the TTS.

Global Problem management processes

This process aims to address problems with cause and location still unknown.

It will be initiated by Router operator, maybe triggered by Grid data contacts (low throughput experienced etc). Grid Data contact will be kept informed by Router operators.
After taking an overview of current state of the LHCOPN (look at monitoring, ticket's dashboard...) and being sure another problem is not yet running and reported the incident management process will be started
We go on a top-down approach, starting at L3 (routers, IP, BGP, filtering...) with L3 incident management process
If unsuccessful we go at L2 (dark fibres...) with L2 incident management process .This is under the responsibilities of router operators to distinguish between L2 and L3 problems.
If no previous process is able to tackle the issue in a reasonable delay the Escalated incident management process is initiated by the router operator of the site noticing the problem.

Incident management process

Even some incidents still resolved when noticed should be reported (for post mortem analysis, quality assessment and information etc.).

L3 incident management process

Scope: Router down, BGP filtering, bad routing...
The source site is the site where the problem lies.

1.1 A tickets is created on the LHCOPN Heldpesk for reporting by the router operator of the source site. It is assigned to itself, the source site.
1.2 The Router Operator contacts is counterpart on distant site (site-site communication) to know if something goes wrong (power outage...). If problem is on distant site the distant site will start this process (ticket then re-assigned to distant site).
1.3 If the problem is related to an underlying layer (L2: dark fiber outage...) the router operator will start the L2 incident management process. The router operator will be responsible to manage the trouble with the L2NOC (open and follow NOC's ticket...). He stays responsible for the LHCOPN ticket into GGUS.
1.4 Otherwise the router operator is owning the problem and will contact its local Grid Data contact to report impact. Distant Router operator will also be informed.

2 The LHCOPN TTS notifies all impacted sites about the incident

L2 incident management process

Scope: Dark fibres outages...

1.1 A L2NOC and a router operator could notice a L2 incident. They will interact together to confirm it or not. A router operator could also be warned from the L3 incident management process through a LHCOPN ticket assigned to its site
1.2 If confirmed the router operator of a linked site will put a ticket on the LHCOPN TTS. The router operator is in charge of dealing with involved L2 network providers and to reflect ongoing resolution within the LHCOPN TTS.
1.3 It is the responsibilities of linked and affected sites to warn their Grid data contact.

2 All impacted sites will be notified by the TTS.

3 If nothing if found at L2 the Escalated incident management process is started.

Escalated incident management process

If no previous process is able to tackle an issue (strange versatile performance problem, filtering...), or if the resolution time seem unreasonable this process is initiated by router operator of the site noticing the problem. This process should be started after the maximum delay of one week (i.e at least for any incident lasting for more than one week).

The router operator will perform a phoneconf with all persons of all potential faulty domains involved to agree on the workplan to localise and fix the issue. The precise list of people/organisation to attend depends of the outage and will be chosen by the router operator. The existing ticket into GGUS is updated with outcomes of the phoneconf, and its priority is increased.

Change management process

The change management process tracks and documents major changes occurring on the LHCOPN (infrastructure, routing, filtering,...).

A change without impact could be done at any time
A change with impact MUST be implemented with a maintenance.

This process is different from maintenances, because we can have maintenance without any change (f.i scheduled power cut by power supplier, fibre needing to be cleaned ...).

Major changes are at least: change in routing, change in filtering, new IP prefixe, fibre change, change of IOS version.

There is no negotiation for changes, if necessary this will be done in the maintenance implementing the change.

To roll back a change there is two possibilities:

The roll back is done in the maintenance window: the maintenance is considered not done
The roll back has to be done after the end of the maintenance window: Another change process should be started to do the rollback

L3 change management process

Scope: IP addresses change, new prefix propagated, new filtering

The source actor for these changes are router operators.

1.1 Router operator will expose change to its Grid data contact (change in performing, new resiliency possibility ...)
1.2 Router operator will expose change to affected sites (e.g linked sites)

2.1 The change will be fully documented on the change management database and technical information will also be updated.
2.2 DANTE operation may be warned if the change has a impact on the monitoring (new IP to be watched etc.). Site is responsible to ensure and follow update of the monitoring system.
2.3 ENOC may be warned to update L3 BGP monitoring and/or to trigger update of the trouble ticket system. Site is responsible to ensure and follow that.

3 If the change has an impact a L3 maintenance management process will be started to commit and broadcast the change. Link to the full documentation of the change is to be provided (e.g URL to the Global web repository).

If we have some L3 changes impacting the L2 (L3 VPN for instance) the L3 change management process is started as being the major event. If the change has no impact it could be silently done but has to be accurately documented.

L2 change management process

This is a complex process as the lower you go the most you impact. A L2 change could have an impact at L3 (new IP addresses for a new link...) but everything is done into the L2 change management process as being the root event.

Scope: New LHCOPN L2 link, L2 link with new physical path, change of L2 network provider for a segment...

The source for L2 changes are L2 network providers.

1.1 The L2NOC send its change to router operators of affected sites
1.2 Router operator expose changes and impacts to its Grid data contact
1.3 Router operator expose changes and impacts to router operator of impacted sites

2.1 The change will be documented by router operator on the global web repository and some technical information should also be changed
2.2 DANTE operation may be warned if the change has a impact on the monitoring (new IP to be watched etc.). Site is responsible to ensure and follow update of the monitoring system.
2.3 ENOC may be warned to update L3 BGP monitoring and/or to trigger update of the trouble ticket system. Site is responsible to ensure and follow that.

3 If the change has an impact a L2 maintenance management process will be started to commit changes. Else the change could be silently done but always accurately documented.

The Backup test process should be done whenever new possibility for resiliency is possible to validate it and to ensure nothing else is affected.

Maintenance management process

L3 maintenance management process

Scope: scheduled power outage on site, router IOS upgrade, ...

1.1 The router operator on source site try to find a suitable date with its local Grid Data contact
1.2 The date could also be negotiated - off the record - with all sites that could be affected by the maintenance (e.g linked sites)

2 A ticket is created into the LHCOPN TTS by the router operator of the source site
3 All affected sites are notified by the LHCOPN TTS
4 The maintenance is performed and the LHCOPN TT is updated. Updates are broadcasted to all impacted sites. It ends when LHCOPN TT is closed.

There is no public negotiation phase.

The notice window to announce maintenance should be according to the impact:

Impact duration	Notice window
More than 1 hour	1 week
Less than 1 hour	2 days
No impact	1 day

This is compliant with the WLCG rules for scheduled downtimes.

A maintenance put into the TTS and broadcasted is silently accepted after one day (2 hours if it has no impact). All delays are expressed in the working hours 09:00 to 16:00 UTC. All days are considered worked. Emergency maintenances are allowed but should not be a common thing.

Even maintenances without impact should be put on the TTS (maintenance at risk for instance).

L2 maintenance management process

Sources for L2 Maintenance are L2 network providers (optical transmitter to be changed, fibre physically rerouted, fibre to be cleaned...)

Often we will not have negotiation phase for L2 maintenance with L2 network providers. But if an event is really disturbing this should be tried.

1.1 The L2NOC will send its maintenance to connected or affected Router operators. The first noticed router operator start this process.
1.2 The router operator will warn its Grid data contact (and may check with him date is ok)
1.3 The router operator may check with distant affected sites - off the record - that the date is suitable
1.4 If a disturbing overlapping event is found we should try to negotiate another date with the network provider and we restart at step 1.1 . Else the maintenance is posted in the LHCOPN TTS by the router operator.

2 All impacted sites are notified.

3 The maintenance is performed and the LHCOPN TT is updated. Updates are broadcasted to all impacted sites. It ends when LHCOPN TT is closed.

Handling Multi Hop troubles

Problem example:

Site 1 unables to reach site 3 but ables to reach site 2
Site 2 ables to reach site 3

Proposed handling:

L3 problem assigned by site 1 to site 3
If no resolution, site 1 reassigns it to site 2

Benefits:

Keep only one ticket per trouble enabling serialisation of trouble resolution
Problem’s responsibility transfered with ticket’s re-assignment
Initiator follows trouble

Responsibilities

Outages on links between T0 and T1 are of responsibility of T1s (who ordered the link)
Responsibility for outages on T1-T1 links are being studied (should be mapped from existing contract by studying costs model: who pays what, where).
Responsibility for GGUS' ticket is on the site which the ticket is assigned to.

LHCOPN Operational Working group

Contact

This page is maintained by the LHCOPN operational working Group. It can be reached at project-lhcopn-opswg@cernNOSPAMPLEASE.ch .

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who
png	AIRM.png	r5 r4 r3 r2 r1	manage	54.2 K	2008-12-16 - 12:19	GuillaumeCessieux
png	Actors.png	r5 r4 r3 r2 r1	manage	30.9 K	2008-12-16 - 12:10	GuillaumeCessieux
png	CL2.png	r7 r6 r5 r4 r3	manage	67.4 K	2009-04-07 - 06:50	GuillaumeCessieux
png	CL3.png	r8 r7 r6 r5 r4	manage	68.1 K	2009-04-07 - 06:42	GuillaumeCessieux
png	DC.png	r1	manage	62.8 K	2008-07-21 - 11:01	GuillaumeCessieux
png	IA.png	r4 r3 r2 r1	manage	82.2 K	2008-12-16 - 12:25	GuillaumeCessieux
png	IL2.png	r7 r6 r5 r4 r3	manage	47.7 K	2008-12-16 - 12:37	GuillaumeCessieux
png	IL3.png	r7 r6 r5 r4 r3	manage	46.5 K	2008-12-16 - 12:32	GuillaumeCessieux
png	MHT.png	r1	manage	9.5 K	2008-07-23 - 14:38	GuillaumeCessieux
png	ML2.png	r7 r6 r5 r4 r3	manage	49.8 K	2008-12-16 - 12:48	GuillaumeCessieux
png	ML3.png	r6 r5 r4 r3 r2	manage	44.2 K	2008-12-16 - 12:45	GuillaumeCessieux
png	PB.png	r5 r4 r3 r2 r1	manage	49.4 K	2008-12-16 - 12:26	GuillaumeCessieux

Topic revision: r23 - 2010-11-24 - GuillaumeCessieux

LHCOPN

LHCOPN Web
LHCOPN Web Home
Changes
Index
Search

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LHCOPN All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback