Skip to content

Commit

Permalink
Merge pull request #8788 from ministryofjustice/doc/incident-mgt
Browse files Browse the repository at this point in the history
Updated to manage-an-incident doc
  • Loading branch information
dms1981 authored Dec 27, 2024
2 parents 844a3d2 + 8c038ad commit 71c2e77
Showing 1 changed file with 32 additions and 5 deletions.
37 changes: 32 additions & 5 deletions source/runbooks/manage-an-incident.html.md.erb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
owner_slack: "#modernisation-platform"
title: Incident Process
last_reviewed_on: 2024-09-27
last_reviewed_on: 2024-12-18
review_in: 3 months
---

Expand All @@ -20,6 +20,21 @@ This document describes our incident handling process for network, operations an

> This document assumes the user has a PagerDuty account and is part of the Modernisation Platform team. To use the in Slack features the user must authorise the PagerDuty app in Slack.

## Overview:
For ease of use the key steps are documented within the overview, with further details in the sections below.
###1. Confirm that the event constitutes an incident
> If this incident constitutes a security breach, you must also report it [here](https://goo.gl/forms/frsB1h8AGv3Zefwq2)

###2. Declare the incident.
create a pager-duty incident (ensuring it is on the external status page) and a slack channel
###3. Assign roles.
Incident Lead and Scribe are mandatory roles
###4. Managing the incident
including communication - and who to tell
###5. Fixing the problem
###6. Post-incident procedure

Detailed steps follow below:
## 1. Confirm that the event constitutes an incident

> If this incident constitutes a security breach, you must also report it [here](https://goo.gl/forms/frsB1h8AGv3Zefwq2)
Expand Down Expand Up @@ -56,12 +71,15 @@ Declaring the incident will launch a form for you to complete, see an example of

#### Which Priority to assign?

We only use P1, P2 or P3

| **Priority** | Description |
|---|---|
|P1|The whole platform is down or unavailable, all user applications are unavailable|
|P2|Part of the platform is down or unavailable, some user applications are impacted or unavailable|
|P3|Part of the platform is down or unavailable, user applications are still available|


Tick the **Create a dedicated Public Slack channel for this incident** box. You should use this channel to manage the incident.

Click **Create** button once you are happy with the form. This will create a new incident in PagerDuty and it will update PagerDuty Status Page. The PagerDuty Status Page slack integration will then automatically post the status in the [ask-modernisation-platform](https://mojdt.slack.com/archives/C01A7QK5VM1) and [modernisation-platform-update](https://mojdt.slack.com/archives/C02L5MCJ12N) channels.
Expand All @@ -87,6 +105,7 @@ To fill these roles, ask for volunteers from the team, either verbally or via #m
* coordinate our response to the incident
* decide on any additional roles required (e.g. a communications lead may be required)
* ensure that all required roles are filled
* if no communications lead required, the Incident Lead will be responsible for communicating
* ensure that all tasks which need to be handled are being done
* make the final decision whenever we need to choose a course of action
* set the schedule for any regular team check-ins, if those are deemed necessary
Expand All @@ -113,7 +132,16 @@ The scribe is responsible for keeping a log of the incident, including:

## 4. Managing the incident

### 4.1 Recording notes
### 4.1 Communicating
As well as communicating to our user base, the following people should be informed for a P1 or P2:

* Head of Platforms and Architecture

* Head of Hosting

* Product Manager/Delivery Manager

### 4.2 Recording notes

Entries on the incident can be created from Slack using the **Add a Note** action option on the Incident Post in the incident dedicated slack channel (see below)

Expand All @@ -127,7 +155,7 @@ or via PagerDuty Incident page by clicking **+ Add Note** button (see below)
>
> Similarly, these features will not be available in the incident page.

### 4.2 Updating the External Status on PagerDuty
### 4.3 Updating the External Status on PagerDuty

When an incident is raised, an update will become pending on the Modernisation Platform external status page (the internal status will be automatically updated).

Expand All @@ -139,8 +167,7 @@ Fill in the details for the update and publish it, this will also post an update

It is important to use the External Status page as this feeds in to other services in PagerDuty dependant on the Modernisation Platform.

### 4.3 Transferring roles

### 4.4 Transferring roles
It may be necessary to transfer roles from one team member to another, e.g. during long-running incidents. In this case, it is the responsibility of whoever is in a role to ensure that someone else takes it over.

Whoever assumes a role should announce it in the incident slack channel (or thread if the channel was not created), so that the team is aware.
Expand Down

0 comments on commit 71c2e77

Please sign in to comment.