-
Notifications
You must be signed in to change notification settings - Fork 478
Handling errors of InfluxDB under high load
When influxdb load reaches its limits it starts producing various error states that signal the client to back off and retry when the load gets lower. This is common for clusters where nobody wants to pay for unused cluster nodes.
Telegraf implements variable flush interval on internal buffers so that a load of multiple nodes gets spread over time and there are no repeated spikes when multiple clients get synchronized write batches to InfluxDB at the same time.
This feature has been implemented in the branch feature/batching-interval-jitter. Not yet merged in master.
There should be a configurable connection timeout - by default, the write to influx should not take more than 5s with default batch size (1000)
Telegraf is retrying writes on certain types of failures.
InfluxDB maintains a cache on input (configured by cache-max-memory-size CMD option). When the limit is reached InfluxDB rejects further writes with message: ("cache-max-memory-size exceeded: (%d/%d)", n, limit)
This is how the response look like (InfluxDB configuration parameter cache-max-memory-size = 104)
HTTP/1.1 500 Internal Server Error
Content-Encoding: gzip
Content-Type: application/json
Request-Id: 582b0222-f064-11e7-801f-000000000000
X-Influxdb-Version: 1.2.2
Date: Wed, 03 Jan 2018 08:59:02 GMT
Content-Length: 87
{"error":"engine: cache-max-memory-size exceeded: (1440/104)"}
It makes sense to retry the request when the cache gets emptied.
Here is a list of errors returned by InfluxDB for which it doesn't make sense to retry the write. (Because the points were already written or the repeated write would fail again).
conflicting types will get stuck in the buffer forever.
This error is indicates the point is older than the retention policy permits, and is probably not a cause for concern. Retrying will not help unless the retention policy is modified.
This error indicates a bug in the client or InfluxDB parsing of line protocol. Retries will not be successful.
This is just an informational message
Telegraf is retrying to write failed measurements until these are written. There is no limit on number of retries. See the line 126 below:
https://github.com/influxdata/telegraf/blob/master/internal/models/running_output.go/#L126
Telegraf is checking specific errors from influx db so that measurement writes that can't be fixed by retrying aren't written over and over.
https://github.com/influxdata/telegraf/blob/master/plugins/outputs/influxdb/influxdb.go
There is a limit on the buffer of failed writes (10000 entries by default). Additional failed writes replace the oldest entries in it when it gets filled.
As shown here default client http timeout for Telegraf is 5 seconds, see below:
https://github.com/influxdata/telegraf/blob/master/plugins/outputs/influxdb/client/http.go
and
https://blog.cloudflare.com/the-complete-guide-to-golang-net-http-timeouts/#clienttimeouts
Uses default retrofit timeouts:
- Connection timeout: ten seconds
- Read timeout: ten seconds
- Write timeout: ten seconds
https://futurestud.io/tutorials/retrofit-2-customize-network-timeouts These timeouts can't be configured right now.
Hoan : timeout can be configured via OkHttpClient.Builder passed to InfluxDBImpl, as follow
It has been implemented in #396
It has been implemented in #410
There is benchmarking tool used for measure InfluxDB performace. It's behavior should be in sync how clients behave so that the results can be reproduce in real world. Benchmarking tool listens for error messages from the server and slows down when certain errors from the server are reported:
- engine: cache maximum memory size exceeded
- write failed: hinted handoff queue not empty
- write failed: read message type: read tcp
- i/o timeout
- write failed: engine: cache-max-memory-size exceeded
- timeout
- write failed: can not exceed max connections of 500