Handling errors of InfluxDB under high load

When influxdb load reaches its limits it starts producing various error states that signal the client to back off and retry when the load gets lower. This is common for clusters where nobody wants to pay for unused cluster nodes.

Features that support client robustness

Write interval Jitter

Telegraf implements variable flush interval on internal buffers so that a load of multiple nodes gets spread over time and there are no repeated spikes when multiple clients get synchronized write batches to InfluxDB at the same time.

This feature has been implemented in the branch feature/batching-interval-jitter. Not yet merged in master.

Well configured connection timeouts

There should be a configurable connection timeout - by default, the write to influx should not take more than 5s with default batch size (1000)

Retry writes on errors

Telegraf is retrying writes on certain types of failures.

Retry worth errors

cache-max-memory-size

InfluxDB maintains a cache on input (configured by cache-max-memory-size CMD option). When the limit is reached InfluxDB rejects further writes with message: ("cache-max-memory-size exceeded: (%d/%d)", n, limit)

This is how the response look like (InfluxDB configuration parameter cache-max-memory-size = 104)

HTTP/1.1 500 Internal Server Error
Content-Encoding: gzip
Content-Type: application/json
Request-Id: 582b0222-f064-11e7-801f-000000000000
X-Influxdb-Version: 1.2.2
Date: Wed, 03 Jan 2018 08:59:02 GMT
Content-Length: 87

{"error":"engine: cache-max-memory-size exceeded: (1440/104)"}

It makes sense to retry the request when the cache gets emptied.

Non-recoverable / informational errors

Here is a list of errors returned by InfluxDB for which it doesn't make sense to retry the write. (Because the points were already written or the repeated write would fail again).

'field type conflict'

conflicting types will get stuck in the buffer forever.

'points beyond retention policy'

This error is indicates the point is older than the retention policy permits, and is probably not a cause for concern. Retrying will not help unless the retention policy is modified.

'unable to parse'

This error indicates a bug in the client or InfluxDB parsing of line protocol. Retries will not be successful.

'hinted handoff queue not empty'

This is just an informational message

'database not found'

Telegraf Implementation Status

Telegraf is retrying to write failed measurements until these are written. There is no limit on number of retries. See the line 126 below:

https://github.com/influxdata/telegraf/blob/master/internal/models/running_output.go/#L126

Telegraf is checking specific errors from influx db so that measurement writes that can't be fixed by retrying aren't written over and over.

https://github.com/influxdata/telegraf/blob/master/plugins/outputs/influxdb/influxdb.go

There is a limit on the buffer of failed writes (10000 entries by default). Additional failed writes replace the oldest entries in it when it gets filled.

As shown here default client http timeout for Telegraf is 5 seconds, see below:

https://github.com/influxdata/telegraf/blob/master/plugins/outputs/influxdb/client/http.go

and

https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/#clienttimeouts

Influx DB Java implementation status

Uses default retrofit timeouts:

Connection timeout: ten seconds
Read timeout: ten seconds
Write timeout: ten seconds

https://futurestud.io/tutorials/retrofit-2-customize-network-timeouts These timeouts can't be configured right now.

Hoan : timeout can be configured via OkHttpClient.Builder passed to InfluxDBImpl, as follow

No flush interval jittering

It has been implemented in #396

No retry on backoff messages

It has been implemented in #410

Benchmarking tool implementation status

There is benchmarking tool used for measure InfluxDB performace. It's behavior should be in sync how clients behave so that the results can be reproduce in real world. Benchmarking tool listens for error messages from the server and slows down when certain errors from the server are reported:

engine: cache maximum memory size exceeded
write failed: hinted handoff queue not empty
write failed: read message type: read tcp
i/o timeout
write failed: engine: cache-max-memory-size exceeded
timeout
write failed: can not exceed max connections of 500

Provide feedback

Saved searches

Use saved searches to filter your results more quickly