-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add docs about pipeline recovery (#155)
* Add docs about pipeline recovery * update * explain max-retries-window * explain max-retries-window * note about default behavior, correct page position * how to disable * small fix * edit: Pipeline recovery page and update conduit architecture (#157) * update architecture svg * small edits * typo * remove link --------- Co-authored-by: Raúl Barroso <ra.barroso@gmail.com>
- Loading branch information
Showing
10 changed files
with
207 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,199 @@ | ||
--- | ||
title: 'Pipeline Recovery' | ||
sidebar_position: 5 | ||
--- | ||
|
||
Pipeline Recovery is a feature in Conduit through which pipelines that | ||
experience certain types of errors are automatically restarted. This document | ||
describes how Pipeline Recovery in Conduit works and how it can be configured. | ||
|
||
:::note | ||
|
||
Pipeline Recovery is enabled by default since [Conduit v0.12](https://github.com/ConduitIO/conduit/releases/tag/v0.12.0). The feature can | ||
be [disabled](#disabling-pipeline-recovery) if needed. | ||
|
||
::: | ||
|
||
## Introduction | ||
|
||
Most pipeline errors encountered are a result of temporary issues like network | ||
interruptions or services being unavailable due to maintenance. It then becomes | ||
a matter of how we handle the pipeline. | ||
|
||
In most cases, simply retrying is enough to get through transient errors | ||
efficiently. This can and should be done by connectors and processors, but that | ||
may not always be the case. For Conduit users, this typically means they would | ||
need to wait for the connector or processor to be updated. This implies that we | ||
need to have in Conduit an automatic mechanism for restarting pipelines that | ||
experienced an error. | ||
|
||
## What triggers pipeline recovery | ||
|
||
Any _non-fatal_ error can trigger pipeline recovery. In the context of | ||
pipelines, we differentiate between two types of errors: fatal and non-fatal. | ||
|
||
**Fatal errors** are the errors that a pipeline cannot recover from. Two types | ||
of errors are fatal by default: | ||
|
||
1. [DLQ threshold](/docs/features/dead-letter-queue) exceeded (because the purpose of the DLQ threshold is to stop a | ||
pipeline if too many records are nack-ed) | ||
2. Processor errors (because processors are usually deterministic, so if a | ||
processor failed processing a record once, it will most likely fail | ||
processing a record again) | ||
|
||
:::info | ||
In one of the next versions of Conduit and the Connector SDK, we'll make it | ||
possible for connectors to define what errors they consider as fatal. | ||
::: | ||
|
||
**Non-fatal errors** are all the other errors, i.e. errors for which it makes | ||
sense to restart a pipeline and retry the data streaming. | ||
|
||
## Algorithm | ||
|
||
Pipeline Recovery restarts a pipeline using a linear backoff algorithm. One | ||
important addition is that Conduit tracks the number of retries over a period of | ||
time configured by the option `max-retries-window`. | ||
|
||
If `max-retries` has been set, then Conduit will | ||
make sure that there are at most `max-retries` over **any** `max-retries-window` | ||
period of time. In other words, the number of restart attempts is reset to 0 | ||
after `max-retries-window`. | ||
|
||
This way, `max-retries-window` makes sure that we can safely reset the attempts | ||
to 0 even in the following cases: | ||
|
||
1. When a pipeline has been running long enough. For example, if a pipeline | ||
failed `max-retries - 1` times months ago, then it's unexpected that it fails | ||
now because of one failure. | ||
2. When a pipeline is unstable and frequently failing. For example, a pipeline | ||
might experience an error and recover just before `max-retries` has been | ||
reached. It might run for some time more and then fail again. If we're not | ||
careful about resetting `max-retries`, then the pipeline might enter another | ||
recovery cycle very soon, which would be against what a user intended | ||
(limiting the number of retries). | ||
|
||
Here's a diagram of the algorithm: | ||
|
||
![Pipeline Recovery](/img/pipeline-recovery.png) | ||
|
||
## How to disable Pipeline Recovery | ||
|
||
To disable Pipeline Recovery, you should set the configuration parameter | ||
`pipelines.error-recovery.max-retries` to 0. With this, you'll have the | ||
pre-v0.12 behavior, where a pipeline stops and goes into the `degraded` state | ||
the first time it experiences an error. | ||
|
||
## Configuration | ||
|
||
### pipelines.error-recovery.min-delay | ||
|
||
- **Data Type**: [Duration](https://pkg.go.dev/time#ParseDuration) | ||
- **Required**: No | ||
- **Default**: 1s | ||
- **Description**: Minimum delay before restarting the pipeline after an error. | ||
|
||
### pipelines.error-recovery.max-delay | ||
|
||
- **Data Type**: [Duration](https://pkg.go.dev/time#ParseDuration) | ||
- **Required**: No | ||
- **Default**: 10m | ||
- **Description**: Maximum delay allowed before restarting the pipeline after an | ||
error. | ||
|
||
### pipelines.error-recovery.backoff-factor | ||
|
||
- **Data Type**: Integer | ||
- **Required**: No | ||
- **Default**: 2 | ||
- **Description**: Backoff factor applied to the last delay when recovering from | ||
errors. | ||
|
||
### pipelines.error-recovery.max-retries | ||
|
||
- **Data Type**: Integer | ||
- **Required**: No | ||
- **Default**: -1 | ||
- **Description**: Maximum number of retry attempts before the pipeline gives | ||
up. A value of `-1` indicates infinite retries. | ||
|
||
|
||
### pipelines.error-recovery.max-retries-window | ||
|
||
- **Data Type**: [Duration](https://pkg.go.dev/time#ParseDuration) | ||
- **Required**: No | ||
- **Default**: 5m | ||
- **Description**: No more than `max-retries` are allowed within any | ||
`max-retries-window` period of time. | ||
|
||
## Examples | ||
|
||
### Example 1: Setting a custom `max-delay` | ||
|
||
The `conduit.yaml` below shows how to set a custom maximum delay before a | ||
pipeline is restarted. By default, `max-delay` is 10 minutes and here we're | ||
setting it to 10 seconds. | ||
|
||
```yaml | ||
pipelines: | ||
error-recovery: | ||
max-delay: "10s" | ||
``` | ||
If a pipeline experiences an error, the delays between restarts will be as follows: | ||
* Attempt 1: 1s | ||
* Attempt 2: 2s | ||
* Attempt 3: 4s | ||
* Attempt 4: 8s | ||
* Attempt 5: 10s | ||
* Attempt 6: 10s | ||
* ... | ||
* Attempt N: 10s | ||
## Example 2: Setting a custom `max-retries` and `max-retries-window` | ||
|
||
The `conduit.yaml` below shows how to limit the number of pipeline restarts to | ||
3 (by default, the number of retries isn't limited). It also sets the | ||
`max-retries-window` to 10 minutes (default value is 5 minutes). | ||
|
||
```yaml | ||
pipelines: | ||
error-recovery: | ||
max-retries: "2" | ||
max-retries-window: "10m" | ||
``` | ||
|
||
Below, we'll explain different scenarios with failing pipelines using the same | ||
Conduit configuration. | ||
|
||
**Scenario 1: Maximum number of retries reached** | ||
|
||
**15:00** Pipeline starts. | ||
**15:10** Pipeline experiences an error (source system unavailable). | ||
**15:10** Pipeline is restarted (retry #1). | ||
|
||
**15:11** Pipeline experiences an error again. | ||
**15:11** Pipeline is restarted (retry #2). | ||
|
||
**15:15** Pipeline experiences an error again. | ||
**15:15** Pipeline has been restarted 2 times (`max-retries`) over the last 10 | ||
minutes (`max-retries-window`). No more retries are allowed, hence the pipeline stops | ||
and its status is set to `degraded`. | ||
|
||
**Scenario 2: Number of retries reset after some time** | ||
|
||
**15:00** Pipeline starts. | ||
|
||
**15:10** Pipeline experiences an error (source system unavailable). | ||
**15:10** Pipeline is restarted (retry #1). | ||
|
||
**15:11** Pipeline experiences an error again. | ||
**15:11** Pipeline is restarted (retry #2). | ||
|
||
**15:20** Internally, Conduit increments the number of available retries by 1, | ||
because 10 minutes (`max-retries-window`) since the first retry have passed. | ||
|
||
**15:35** Pipeline experiences an error again. | ||
**15:35** Pipeline is restarted because we didn't restart 2 times (`max-retries`) over the last 10 | ||
minutes (`max-retries-window`). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,7 @@ | ||
--- | ||
sidebar_position: 0 | ||
hide_title: true | ||
title: 'Introduction' | ||
sidebar_label: "Introduction" | ||
slug: / | ||
--- | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.