Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Discussion] Dealing with congestion by adding internal buffer handling options #2905

Closed
dirkdevriendt opened this issue Jun 10, 2017 · 7 comments
Labels
feature request Requests for new plugin and for new features to existing plugins
Milestone

Comments

@dirkdevriendt
Copy link

I was investigating how we might introduce Kafka in our dataflow (application data, monitoring and operational commands, to and from iot devices, influx and our docker cluster) to improve the robustness of the data streams, only to find out that, despite all the required pieces being available, telegraf would seem to be a bad fit for passing data from one stage to the next.

If I understand correctly, during maintenance windows, network glitches, downtimes, etc telegraf would continue to poll/read inputs and if/when the metric_buffer_limit is reached, drop all new data. Making this work transparently is why we're looking at Kafka in the first place. (Related issue: #2265)

I do not know the telegraf code and @sparrc points out in #2240 that inputs and outputs are designed to be independent and getting feedback from one to the other would require fundamental changes.
So I was hoping to take a step back, put a few thoughts out there and see if this is useful input.

Handling full buffers

The output buffer is obviously an awesome feature, but I'd love to be able to mitigate a full buffer by:

  • making place by rolling the data (remove oldest metrics)
  • making place by aggregating data (push the whole buffer through an aggregator (e.g. sum/avg, depending on metric?), doubling the time interval (needs Aggregator plugins: Support historical data #1992?))
  • persisting to disk (Output buffer persistence #802)
  • optionally publishing an "I'm clogged up" message to the whole telegraf process, so other components/plugins can decide what to do with it (could the built-in rpc features in Go make this possible without architectural changes?)

Reacting to output plugin state

If it knows what is going on in other telegraf processes, something like a message queue input plugin might have better options to deal with downstream congestion (stop polling / tell source to keep replay info / store offset / keep feeding for the benefit of still-active outputs / ...)

Buffers at other stages

Moving or duplicating the buffer feature to input/processor/aggregator plugins might be useful too, but only makes sense if there are triggers that would tell the respective plugins to start filling their buffers.

@danielnelson
Copy link
Contributor

My current thoughts on this are that we should provide a way for an input to know when metrics have been successfully been emitted by either one output or all outputs. The input can then ack queue messages or advance its offset.

making place by rolling the data

This is the current behavior.

making place by aggregating data

Metrics cannot be aggregated in general, but should be an option when more aggregators are added.

persisting to disk

I'm not very interested in this, think it is more trouble than it is worth.

optionally publishing an "I'm clogged up" message

Yeah maybe, maybe this is covered by the ack signal mentioned above.

@dirkdevriendt
Copy link
Author

Absolutely, such a signal that can be captured by "interested" plugins would be a very flexible solution

@thannaske
Copy link

Persisting to disk could be a very useful failover feature for Telegraf. For me personally it would be the best improvement I could imagine. I use Telegraf to monitor a bunch of automated devices/"things" in my home and at other places. And the machines running Telegraf don't have any internet-failover. All metrics collected during an internet blackout are gone in this case. Another use-case would be a failure of output target, e.g. the InfluxDB instance. The file system of the machine running Telegraf is something you can rely on, even when internet fails, the output machine fails or other uncommon events happen.

So I'm not agreeing with @danielnelson that this would be more trouble than it is worth. Please consider implementing a feature like mentioned in #802.

@Anderen2
Copy link

Dropping into the discussion here.

@thannaske would not using Kafka or any of the MQ in/outputs solve this if the "ack / congestion" issue is fixed in Telegraf?

I would prefer that Telegraf did not acquire features that other products made for the task could do much better.

@thannaske
Copy link

@Anderen2 I don't know how the most users are using Telegraf but for my case I would prefer not setting up an additional software stack just to buffer data in case of loss of connectivity. I'm running Telegraf on many small embedded *nix-systems to collect sensor statistics. The systems would not be capable running Kafka or something other in addition to Telegraf.

I also think that you can't compare a configurable buffer with a fully-functional message pipeline. That seems a little overkill for me.

@Anderen2
Copy link

Anderen2 commented Apr 13, 2018

@thannaske Hmm, I see your point.
I am in the same boat as @dirkdevriendt where we'd like to use Kafka as a buffering queue for when InfluxDB is unavailable for some reason.

However this is quite ineffective when Telegraf keeps eating the queue, and tosses everything away when it's internal buffer is filled.

To avoid IO congestion we'd also prefer that the Telegraf buffers could be in-memory only (as I assume that it will not be as write-effective as Kafka is). But I have nothing against persisting buffers as long as they are toggle-able in the configuration.

As a sidenote, have you looked into MQTT? It's a quite lightweight protocol created specifically to use for telemetry from low-end devices. There exists tons of brokers for it, both feature-rich and lightweight ones.

@danielnelson
Copy link
Contributor

This has been addressed for the 1.9.0 release in #4938. The kafka consumer, and other queue consumers, have a new option: max_undelivered_messages = 1000 which controls how many metrics to take without them being delivered. When metrics are sent successfully the offset is updated. In order to always trigger an immediate flush, this value should be around the same size as the agent metric_buffer_size.

This is a pretty big change for the queue consumers, so I would appreciate any testing that can be done.

@danielnelson danielnelson added this to the 1.9.0 milestone Nov 12, 2018
@danielnelson danielnelson added the feature request Requests for new plugin and for new features to existing plugins label Nov 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requests for new plugin and for new features to existing plugins
Projects
None yet
Development

No branches or pull requests

4 participants