Skip to content

Latest commit

 

History

History
103 lines (99 loc) · 5.27 KB

severity_levels.md

File metadata and controls

103 lines (99 loc) · 5.27 KB

The first step in any incident response process is to determine what actually constitutes an incident. Incidents can then be classified by critical, high, medium, and low. Operational issues can be classified at one of those levels, and in general you are able to take more risky moves to resolve a higher severity issue. Anything above medium is automatically considered a "major incident" and gets a more intensive response than a normal incident.

!!! note "Always Assume The Worst" If you are unsure which level an incident is (e.g. not sure if High or Critical), treat it as the higher one. During an incident is not the time to discuss or litigate severities, just assume the highest and review during a post-mortem.

!!! question "Can a Medium be a major incident?" All Medium priorities are major incidents, but not all major incidents need to be classified as Medium. If you require co-ordinated response, even for lower severity issues, then trigger our incident response process. The IC can make a determination on whether full incident response is necessary.

Priority Description Typical Response
Critical

Critical issue that warrants customer or public notification.

  • The system is in a critical state or down.
  • Service issue(s) (Ex. Apache/Nginx/MySQL/Disk/etc...) that is affected customer application/website(s)
  • Customer-data-exposing security vulnerability has come to our attention.
  • MNX.io Hypervisor or Portal Issue.

Major incident response.

  • See During an Incident.
  • Open Tickets with Customers.
  • Notify team in Slack #general Channel.
  • Public notification if this is MNX.io related that involves one or more customers.
High

Critical system issue actively impacting many customers' ability to use the product.

  • The system is in a vulnerable state.
  • Service not directly impacted but can be soon (Ex. Disk filling up that could cause service outage if not handled).
  • Cron jobs not executing that could cause issues later down the road in providing responses.
  • Any other event to which a MNX Solutions employee deems necessary of incident response.

High-Urgency Page.

  • See During an Incident.
  • Open Tickets with Customers.
  • Notify team in Slack #general Channel.
Anything above this line is considered a "Major Incident" and will page our on call person. Our incident response process should be triggered for any major incidents.
Medium

Stability or minor customer-impacting issues that require attention from ticket owners.

  • Ad-Hoc/Project based work that needs to be handled during a time period to prevent a major issues.
  • SSL Certificates that are about to expire in the coming weeks.
  • Something that has the likelihood of becoming High Priority if nothing is done (IE, LogRotate not configured properly, etc...).

Medium Ticket Creation.

  • Work on these after you have completed all tasks with higher priorities.
  • Work on tickets/issues as assigned.
  • Keep in contact with customer and notify of any impending issues/delays..
Low

Minor issues requiring action, but not affecting customer ability to use the product.

  • Ad-Hoc/Project based work..
  • Migration based projects of sites/servers.
  • Installing packages/reconfiguring configuration files/services.

Low Ticket Creation.

  • Work on these after you have completed all tasks with higher priorities.
  • Work on tickets/issues as assigned.
  • Keep in contact with customer and notify of any impending issues/delays..

!!! note "Be Specific" These priority descriptions have been changed from the PagerDuty internal definitions to be more generic. For your own documentation, you are encouraged to make your definitions very specific, usually referring to a % of users/accounts affected. You will usually want your severity definitions to be metric driven.