Skip to content

Commit

Permalink
updates to incident managent documenation - added summary
Browse files Browse the repository at this point in the history
  • Loading branch information
SimonPPledger committed Dec 12, 2024
1 parent 3bde826 commit 025d5ae
Showing 1 changed file with 27 additions and 4 deletions.
31 changes: 27 additions & 4 deletions source/runbooks/manage-an-incident.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
owner_slack: "#modernisation-platform"
title: Incident Process
last_reviewed_on: 2024-09-27
last_reviewed_on: 2024-12-11
review_in: 3 months
---

Expand All @@ -20,6 +20,18 @@ This document describes our incident handling process for network, operations an

> This document assumes the user has a PagerDuty account and is part of the Modernisation Platform team. To use the in Slack features the user must authorise the PagerDuty app in Slack.

## Overview:
For ease of use the key steps are documented within the overview, with further details in the sections below.
###1. Confirm that the event constitutes an incident
> If this incident constitutes a security breach, you must also report it [here](https://goo.gl/forms/frsB1h8AGv3Zefwq2)

###2. Declare the incident – create a pager-duty incident (ensuring it is on the external status page) and a slack channel
###3. Agree roles – Incident Lead and Scribe are mandatory roles
###4. Communicate – how and who
###5. Fixing the problem
###6. Post-incident procedure

Detailed steps follow below:
## 1. Confirm that the event constitutes an incident

> If this incident constitutes a security breach, you must also report it [here](https://goo.gl/forms/frsB1h8AGv3Zefwq2)
Expand Down Expand Up @@ -61,6 +73,7 @@ Declaring the incident will launch a form for you to complete, see an example of
|P1|The whole platform is down or unavailable, all user applications are unavailable|
|P2|Part of the platform is down or unavailable, some user applications are impacted or unavailable|
|P3|Part of the platform is down or unavailable, user applications are still available|
We only use P1, P2 or P3

Tick the **Create a dedicated Public Slack channel for this incident** box. You should use this channel to manage the incident.

Expand All @@ -87,6 +100,7 @@ To fill these roles, ask for volunteers from the team, either verbally or via #m
* coordinate our response to the incident
* decide on any additional roles required (e.g. a communications lead may be required)
* ensure that all required roles are filled
* if no communications lead required, the Incident Lead will be responsible for communicating
* ensure that all tasks which need to be handled are being done
* make the final decision whenever we need to choose a course of action
* set the schedule for any regular team check-ins, if those are deemed necessary
Expand All @@ -113,7 +127,16 @@ The scribe is responsible for keeping a log of the incident, including:

## 4. Managing the incident

### 4.1 Recording notes
### 4.1 Communicating
As well as communicating to our user base, the following people should be informed for a P1 or P2:

* Head of Platforms and Architecture

* Head of Hosting

* Product Manager/Delivery Manager

### 4.2 Recording notes

Entries on the incident can be created from Slack using the **Add a Note** action option on the Incident Post in the incident dedicated slack channel (see below)

Expand All @@ -127,7 +150,7 @@ or via PagerDuty Incident page by clicking **+ Add Note** button (see below)
>
> Similarly, these features will not be available in the incident page.

### 4.2 Updating the External Status on PagerDuty
### 4.3 Updating the External Status on PagerDuty

When an incident is raised, an update will become pending on the Modernisation Platform external status page (the internal status will be automatically updated).

Expand All @@ -139,7 +162,7 @@ Fill in the details for the update and publish it, this will also post an update

It is important to use the External Status page as this feeds in to other services in PagerDuty dependant on the Modernisation Platform.

### 4.3 Transferring roles
### 4.4 Transferring roles

It may be necessary to transfer roles from one team member to another, e.g. during long-running incidents. In this case, it is the responsibility of whoever is in a role to ensure that someone else takes it over.

Expand Down

0 comments on commit 025d5ae

Please sign in to comment.