[Feature Discussion] Dealing with congestion by adding internal buffer handling options #2905

dirkdevriendt · 2017-06-10T12:47:23Z

I was investigating how we might introduce Kafka in our dataflow (application data, monitoring and operational commands, to and from iot devices, influx and our docker cluster) to improve the robustness of the data streams, only to find out that, despite all the required pieces being available, telegraf would seem to be a bad fit for passing data from one stage to the next.

If I understand correctly, during maintenance windows, network glitches, downtimes, etc telegraf would continue to poll/read inputs and if/when the metric_buffer_limit is reached, drop all new data. Making this work transparently is why we're looking at Kafka in the first place. (Related issue: #2265)

I do not know the telegraf code and @sparrc points out in #2240 that inputs and outputs are designed to be independent and getting feedback from one to the other would require fundamental changes.
So I was hoping to take a step back, put a few thoughts out there and see if this is useful input.

Handling full buffers

The output buffer is obviously an awesome feature, but I'd love to be able to mitigate a full buffer by:

making place by rolling the data (remove oldest metrics)
making place by aggregating data (push the whole buffer through an aggregator (e.g. sum/avg, depending on metric?), doubling the time interval (needs Aggregator plugins: Support historical data #1992?))
persisting to disk (Output buffer persistence #802)
optionally publishing an "I'm clogged up" message to the whole telegraf process, so other components/plugins can decide what to do with it (could the built-in rpc features in Go make this possible without architectural changes?)

Reacting to output plugin state

If it knows what is going on in other telegraf processes, something like a message queue input plugin might have better options to deal with downstream congestion (stop polling / tell source to keep replay info / store offset / keep feeding for the benefit of still-active outputs / ...)

Buffers at other stages

Moving or duplicating the buffer feature to input/processor/aggregator plugins might be useful too, but only makes sense if there are triggers that would tell the respective plugins to start filling their buffers.

danielnelson · 2017-06-12T18:13:38Z

My current thoughts on this are that we should provide a way for an input to know when metrics have been successfully been emitted by either one output or all outputs. The input can then ack queue messages or advance its offset.

making place by rolling the data

This is the current behavior.

making place by aggregating data

Metrics cannot be aggregated in general, but should be an option when more aggregators are added.

persisting to disk

I'm not very interested in this, think it is more trouble than it is worth.

optionally publishing an "I'm clogged up" message

Yeah maybe, maybe this is covered by the ack signal mentioned above.

dirkdevriendt · 2017-06-14T09:34:45Z

Absolutely, such a signal that can be captured by "interested" plugins would be a very flexible solution

thannaske · 2017-07-05T11:42:57Z

Persisting to disk could be a very useful failover feature for Telegraf. For me personally it would be the best improvement I could imagine. I use Telegraf to monitor a bunch of automated devices/"things" in my home and at other places. And the machines running Telegraf don't have any internet-failover. All metrics collected during an internet blackout are gone in this case. Another use-case would be a failure of output target, e.g. the InfluxDB instance. The file system of the machine running Telegraf is something you can rely on, even when internet fails, the output machine fails or other uncommon events happen.

So I'm not agreeing with @danielnelson that this would be more trouble than it is worth. Please consider implementing a feature like mentioned in #802.

Anderen2 · 2018-04-13T11:06:05Z

Dropping into the discussion here.

@thannaske would not using Kafka or any of the MQ in/outputs solve this if the "ack / congestion" issue is fixed in Telegraf?

I would prefer that Telegraf did not acquire features that other products made for the task could do much better.

thannaske · 2018-04-13T11:10:16Z

@Anderen2 I don't know how the most users are using Telegraf but for my case I would prefer not setting up an additional software stack just to buffer data in case of loss of connectivity. I'm running Telegraf on many small embedded *nix-systems to collect sensor statistics. The systems would not be capable running Kafka or something other in addition to Telegraf.

I also think that you can't compare a configurable buffer with a fully-functional message pipeline. That seems a little overkill for me.

Anderen2 · 2018-04-13T11:57:35Z

@thannaske Hmm, I see your point.
I am in the same boat as @dirkdevriendt where we'd like to use Kafka as a buffering queue for when InfluxDB is unavailable for some reason.

However this is quite ineffective when Telegraf keeps eating the queue, and tosses everything away when it's internal buffer is filled.

To avoid IO congestion we'd also prefer that the Telegraf buffers could be in-memory only (as I assume that it will not be as write-effective as Kafka is). But I have nothing against persisting buffers as long as they are toggle-able in the configuration.

As a sidenote, have you looked into MQTT? It's a quite lightweight protocol created specifically to use for telemetry from low-end devices. There exists tons of brokers for it, both feature-rich and lightweight ones.

danielnelson · 2018-11-12T21:58:05Z

This has been addressed for the 1.9.0 release in #4938. The kafka consumer, and other queue consumers, have a new option: max_undelivered_messages = 1000 which controls how many metrics to take without them being delivered. When metrics are sent successfully the offset is updated. In order to always trigger an immediate flush, this value should be around the same size as the agent metric_buffer_size.

This is a pretty big change for the queue consumers, so I would appreciate any testing that can be done.

danielnelson mentioned this issue Apr 6, 2018

Queue messages are acknowledged immediately when they first should be successfully emitted #3984

Closed

danielnelson closed this as completed Nov 12, 2018

danielnelson added this to the 1.9.0 milestone Nov 12, 2018

danielnelson added the feature request Requests for new plugin and for new features to existing plugins label Nov 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Discussion] Dealing with congestion by adding internal buffer handling options #2905

[Feature Discussion] Dealing with congestion by adding internal buffer handling options #2905

dirkdevriendt commented Jun 10, 2017

danielnelson commented Jun 12, 2017

dirkdevriendt commented Jun 14, 2017

thannaske commented Jul 5, 2017

Anderen2 commented Apr 13, 2018

thannaske commented Apr 13, 2018

Anderen2 commented Apr 13, 2018 •

edited

Loading

danielnelson commented Nov 12, 2018

[Feature Discussion] Dealing with congestion by adding internal buffer handling options #2905

[Feature Discussion] Dealing with congestion by adding internal buffer handling options #2905

Comments

dirkdevriendt commented Jun 10, 2017

Handling full buffers

Reacting to output plugin state

Buffers at other stages

danielnelson commented Jun 12, 2017

dirkdevriendt commented Jun 14, 2017

thannaske commented Jul 5, 2017

Anderen2 commented Apr 13, 2018

thannaske commented Apr 13, 2018

Anderen2 commented Apr 13, 2018 • edited Loading

danielnelson commented Nov 12, 2018

Anderen2 commented Apr 13, 2018 •

edited

Loading