Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for Service Level Objectives #5542

Merged
merged 17 commits into from
Oct 16, 2019
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 18 additions & 44 deletions config/_default/menus/menus.en.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -664,57 +664,31 @@ main:
weight: 6
parent: "alerting"

## MONITORS - SLOs

- name: "Service Level Objectives"
url: "monitors/service_level_objectives/"
weight: 7
parent: "alerting"
identifier: "slos"

- name: "Monitor SLO"
url: "monitors/service_level_objectives/monitor/"
parent: "slos"
weight: 7.1

- name: "Event SLO"
url: "monitors/service_level_objectives/event/"
parent: "slos"
weight: 7.2

## MONITORS - Guides

- name: "Guides"
url: "monitors/guide/"
weight: 100
parent: "alerting"


##############################
## SERVICE LEVEL OBJECTIVES ##
##############################

- name: "Service Level Objectives"
url: "service_level_objectives/"
pre: "nav_slo"
identifier: "slo"
weight: 75000

## SLO - Monitor Types

- name: "Service Level Objectives"
url: "service_level_objectives/slo_types/"
parent: "slo"
identifier: "slo_types"
weight: 1
- name: "Monitor"
url: "service_level_objectives/slo_types/monitor/"
parent: "slo_types"
weight: 101
- name: "Metric"
url: "service_level_objectives/slo_types/metric/"
parent: "slo_types"
weight: 102

## SLO
- name: "List Service Level Objectives"
url: "service_level_objectives/list_slos/"
weight: 2
parent: "slo"

- name: "Service Level Objective Status"
url: "service_level_objectives/slo_status/"
weight: 3
parent: "slo"

- name: "Service Level Objective Dashboard Widget"
url: "service_level_objectives/widget/"
weight: 4
parent: "slo"


#################
## TRACING/APM ##
#################
Expand Down
96 changes: 96 additions & 0 deletions content/en/monitors/service_level_objectives/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
title: Service Level Objectives
kind: documentation
description: "Track the status of your SLOs"
disable_toc: true
aliases:
- /monitors/monitor_uptime_widget/
- /monitors/slos/
further_reading:
- link: "https://www.datadoghq.com/blog/slo-monitoring-widget/"
tag: "Blog"
text: "Track the status of your SLOs with the new monitor uptime widget"
---

## Overview

Service Level Objectives, or SLOs, are a key part of the site reliability engineering toolkit. SLOs provide a
framework for defining clear targets around application performance, which ultimately help teams provide a consistent
customer experience, balance feature development with platform stability, and improve communication with internal and
external users.

## Setup

Use the SLO and uptime widget to track your SLOs (Service Level Objectives) and uptime on screenboards and timeboards. You can use SLO by adding a widget to a dashboard, or by going to Datadog’s [Service Level Objectives page][1] to create new SLOs and view all existing ones. Select an existing SLO from the dropdown and display it on any dashboard.

*Uptime* is defined as the amount of time a monitor was in an *up* state (OK) compared to *down* state (non-OK). The status is represented in bars as green (up) and red (down). Example: ’99 % of the time latency is less than 200ms.`

You can also track success rate and event-based SLIs (Service Level Indicators). Example: `99 % of requests are successful.`

{{< img src="monitors/slo/create-slo.png" alt="create a slo" responsive="true" >}}

### Configuration

1. On the [SLO page][1], select **New SLO +**.
2. Define the source for your monitors. Monitor types are [Event based][6] and [Monitor based][5].
3. Set your target uptime. Available windows are: 7 days, month-to-date, 30 days (rolling), Previous Month, and 90 days (rolling). For 7 days, the widget is restricted to two decimal places. For 30 days and up, it’s restricted to two to three decimal places.
4. Finally, give the SLO a title and save it.

Once you have monitors set up, on the [main Service Level Objectives page][1], you can view the overall uptime percentage only—or the overall percentage, plus the uptime for each monitor.

{{< img src="monitors/slo/slo-overview.png" alt="slo main page" responsive="true" >}}

## Edit an SLO

To edit an SLO, hover over the SLO on the right, and click the edit pencil icon.

## Searching SLOs

The [List Service Level Objectives][1] page lets you run an advanced search of all SLOs so you can view, delete or edit service tags for selected SLOs in bulk. You can also clone or fully edit any individual SLO in the search results.
platinummonkey marked this conversation as resolved.
Show resolved Hide resolved

{{< img src="service_level_objectives/edit_slo/edit_slo_page.png" alt="edit slo page" responsive="true" >}}

Advanced search lets you query SLOs by any combination of SLO attributes:

* `name` and `description` - text search
* `time window` - *, 7d, 30d, 90d
* `type` - metric, monitor
* `creator`
* `id`
* `service` - tags
* `team` - tags
* `env` - tags

To run a search, use the checkboxes on the left and the search bar. When you check the boxes, the search bar updates with the equivalent query. Likewise, when you modify the search bar query (or write one from scratch), the checkboxes update to reflect the change. Query results update in real-time as you edit the query; there's no 'Search' button to click.

To edit an individual SLO, hover over it and use the buttons to the far right in its row: Edit, Clone, Delete. To see more detail on a SLO, click its table row to visit its status page.

{{< img src="service_level_objectives/edit_slo/edit-slo-hover-clone.png" alt="edit-slo-hover-clone" responsive="true" style="width:80%;" >}}

### SLO Tags

{{< img src="service_level_objectives/edit_slo/slo-tags.png" alt="Monitor tags" responsive="true" style="width:30%;" >}}

You can add tags directly to your SLOs for filtering on the [list SLOs][4] pages.

## View your SLOs

You can view, edit your SLO and its properties and see the status over time and the history of your SLO from the [SLO status page][4].

{{< img src="service_level_objectives/slo_status/status_slo_history.mp4" alt="status slo history" video="true" responsive="true" width="80%" >}}

## SLO Widgets

After creating your SLO, you can use the SLO dashboard widget to visualize the status of your SLOs along with your dashboard metrics, logs and APM data. For more information about SLO Widgets, see the [SLO Widgets documentation][7] page.

## Further Reading

{{< partial name="whats-next/whats-next.html" >}}

[1]: https://app.datadoghq.com/slo/new
[2]: /api/#servicelevelobjectives
[3]: /developers/libraries/#managing-service-level-objectives
[4]: https://app.datadoghq.com/slo
[5]: /monitors/service_level_objectives/monitor/
[6]: /monitors/service_level_objectives/event/
[7]: /graphing/widgets/slo/
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Metric based SLO
title: Event based SLO
kind: documentation
description: "Use Metrics to define the Service Level Objective"
description: "Use metrics to define Service Level Objective"
platinummonkey marked this conversation as resolved.
Show resolved Hide resolved
further_reading:
- link: "metrics"
tag: "Documentation"
Expand All @@ -10,44 +10,40 @@ further_reading:

## Overview

Metric based SLOs are useful for a count-based stream of data where you are differentiating good and/or bad events.
Using the sum of the good events divided by the sum of total events over time provides a Service Level Indicator (or SLI).
Event or metric based SLOs are useful for a count-based stream of data where you are differentiating good and bad events.
Using the sum of the good events divided by the sum of total events over time to calculate a Service Level Indicator (or SLI).

## Service Level Objective creation
## Setup

To create a [metric SLO][1] in Datadog, use the main navigation: *Monitors --> New Service Level Objective --> Event Based*.
On the [SLO page][1], select **New SLO +**. Then select **Event**.

### Define the source (SLI)
### Configuration

{{< tabs >}}
{{% tab "Event Based" %}}
#### Define queries

There are 2 queries to define. The first query defines the sum of the good events, while the second query defines the sum of
There are two queries to define. The first query defines the sum of the good events, while the second query defines the sum of
the total events.

It is only recommended to use the `sum by` aggregator and to add all events.
Datadog reccomends the `sum by` aggregator and to add all events.

Example: If tracking HTTP return codes, and your metric includes a tag like `code:2xx` || `code:3xx` || `code:4xx`.
**Example:** If you are tracking HTTP return codes, and your metric includes a tag like `code:2xx` || `code:3xx` || `code:4xx`.
The sum of good events would be `sum:httpservice.hits{code:2xx} + sum:httpservice.hits{code:4xx}`. And the `total` events
would be `sum:httpservice.hits{!code:3xx}`.

Why did we exclude `HTTP 3xx`? - These are typically redirects and should not count for or against the SLI, but other non 3xx
based error codes should. In the `total` case we want all types minus `HTTP 3xx`, in the `numerator` we only want `OK` type
status codes.

{{% /tab %}}
{{% tab "Set your targets" %}}
#### Set your targets

Setting your targets for your SLI's is an important step, this is the value that you are aiming for or better.
SLO targets are the stat you use to measure uptime success.

First select your target value, example: `95% of all HTTP requests should be "good" over the last 7 days`.

You can optionally include a warning value that is greater than the target value to indicate when you are approaching
an SLO breach.


{{% /tab %}}
{{% tab "Identify this indicator" %}}
#### Identify the indicator

Here we add contextual information about the purpose of the SLO, including any related information
in the description and tags you would like to associate with the SLO.
Expand Down
60 changes: 60 additions & 0 deletions content/en/monitors/service_level_objectives/monitor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
title: Monitor SLO
kind: documentation
description: "Use Monitors to define the Service Level Objective"
further_reading:
- link: "monitors"
tag: "Documentation"
text: "More information about Monitors"
---

## Overview

Select a monitor based source if you want to build your SLO based on existing or new Datadog monitors. For more information about monitors, see the [Monitor documentation][1]. Monitor based SLOs are useful for a time-based stream of data where you are differentiating time of good behavior vs bad behavior.
Using the sum of the good time divided by the sum of total time provides a Service Level Indicator (or SLI).

## Setup

On the [SLO page][2], select **New SLO +**. Then select **Monitor**.

### Configuration

#### Define queries

To start, you need to be using Datadog monitors. To set up a new SLO monitor, go to the [monitor page][3]. Search for monitors by name and click on it to add it to the source list. An example SLO on a monitor is if the latency of all user requests should be less than 250ms 99% of the time in any 30 day window. To set this up, you would:

1. Select a single monitor or,
2. Select multiple monitors (up to 20) or,
3. Select a single multi-alert monitor and select specific monitor groups (up to 20) to be included in SLO calculation.

**Supported monitor types**:

- metric monitor types - including metric, anomaly, APM, forecast, outlier, and integration metrics
- service checks
- synthetics

**Example:** You might be tracking the uptime of a physical device. You have already configured a metric monitor on `host:foo` using
a custom metric. This monitor might also ping your on-call team if it's no longer reachable. To avoid burnout you want to
track how often this host is down.

#### Set your targets

SLO targets are the stat you use to measure uptime success.

First select your target value, example: `95% of all HTTP requests should be "good" over the last 7 days`.

You can optionally include a warning value that is greater than the target value to indicate when you are approaching
an SLO breach.

#### Identify this indicator

Here we add contextual information about the purpose of the SLO, including any related information
in the description and tags you would like to associate with the SLO.

## Further Reading

{{< partial name="whats-next/whats-next.html" >}}

[1]: /monitors
[2]: https://app.datadoghq.com/slo/new/monitor
[3]: https://app.datadoghq.com/monitors#create/metric
35 changes: 0 additions & 35 deletions content/en/service_level_objectives/_index.md

This file was deleted.

47 changes: 0 additions & 47 deletions content/en/service_level_objectives/list_slos.md

This file was deleted.

Loading