Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple TICKscript that will always error #1217

Open
desa opened this issue Feb 22, 2017 · 21 comments
Open

Simple TICKscript that will always error #1217

desa opened this issue Feb 22, 2017 · 21 comments

Comments

@desa
Copy link
Contributor

desa commented Feb 22, 2017

Duplicate of influxdata/telegraf#2444

Bug report

Hey. For some reason, the system measurement (load1,5,15, uptime, etc) is sent by telegraf as two distinct lines.

system,host=TICKAlerta load1=0,load5=0.03,load15=0.03,n_users=1i,n_cpus=2i 1487579590000000000
system,host=TICKAlerta uptime_format=" 0:13",uptime=807i 1487579590000000000

This essentially breaks stream processing with kapacitor of system measurement, as relevant field is missing 50% of the times.

E! error evaluating expression for level CRITICAL: no field or tag exists for load1

This can be verified while looking into datapoints, as seen by kapacitor

{
 "Name": "system",
 "Database": "telegraf",
 "RetentionPolicy": "default",
 "Group": "host=host1.ex",
 "Dimensions": {
   "ByName": false,
   "TagNames": [
     "host"
   ]
 },
 "Tags": {
   "environment": "offsite",
   "host": "host1.ex",
   "osname": "Ubuntu",
   "virtual": "physical"
 },
 "Fields": {
   "load1": 0,
   "load15": 0.05,
   "load5": 0.01,
   "n_cpus": 4,
   "n_users": 0
 },
 "Time": "2017-02-14T12:38:40Z"
}
{
 "Name": "system",
 "Database": "telegraf",
 "RetentionPolicy": "default",
 "Group": "host=host1.ex",
 "Dimensions": {
   "ByName": false,
   "TagNames": [
     "host"
   ]
 },
 "Tags": {
   "environment": "offsite",
   "host": "host1.ex",
   "osname": "Ubuntu",
   "virtual": "physical"
 },
 "Fields": {
   "uptime": 5278035,
   "uptime_format": "61 days,  2:07"
 },
 "Time": "2017-02-14T12:38:40Z"
}

Relevant telegraf.conf:

[global_tags]
[agent]
 interval = "10s"
 round_interval = true

 metric_batch_size = 1000

 metric_buffer_limit = 10000

 collection_jitter = "0s"

 flush_interval = "10s"
 flush_jitter = "0s"

 precision = ""

 debug = false
 quiet = false
 logfile = ""

 hostname = ""
 omit_hostname = false

[[outputs.file]]
  files = ["stdout", "/tmp/metrics.out"]

  data_format = "influx"

[[inputs.system]]
 # no configuration

System info:

As far as I know, present in telegraf versions 1.0, 1.1 and 1.2. Tested on Ubuntu and Debian LTS versions (precise, trusty, xenial, jessie).

Steps to reproduce:

Telegraf

Use the included telegraf config file.

telegraf --config telegraf.conf --debug
cat /tmp/metrics.out

Kapacitor

var warn_threshold = 4
var crit_threshold = 10

var period = 1h
var every = 1m

var data = stream
 |from()
   .database('telegraf')
   .retentionPolicy('default')
   .measurement('system')
   .groupBy('host')
 |log()
 |window()
   .period(period)
   .every(every)
 |last('load1')
   .as('stat')
grep load1 /var/log/kapacitor/kapacitor.log

Expected behavior:

Single line for system measurement.

Actual behavior:

Two distinct lines for system measurement

Since the data for a series is split into two lines, InfluxDB forwards it on to Kapacitor as two lines and therefore its possible to write a TICKscript that will perpetually fail

var data = stream
    |from()
        .measurement('system')
    |eval(lambda: "load1" / "uptime")
        .as('ratio')

with

kapacitor define example -tick example.tick -dbrp telegraf.autogen -type stream
kapacitor enable example

will perpetually error out, even though all of the data would exist.

As I see it, there are four ways to solve this problem.

  1. Kapacitor can do deduplication of incoming lines with common series key and timestamp.
  2. InfluxDB can do deduplication of data it reports in subscriptions for lines with common series key and timestamp.
  3. Telegraf can report data as a single line
  4. We can provide documentation that notes this as a rough edge and give users a work around.

I had initially directed @markuskont to opening an issue on Telegraf, but there's definitely more than one way to solve this issue.

@rossmcdonald
Copy link
Contributor

I think the easiest solution here will be to fix Telegraf, which is really sending one point on two separate lines. In my opinion, this type of formatting is also technically not valid line protocol, from the docs:

Each line, separated by the newline character \n, represents a single point in InfluxDB

Where a single point should be represented by a single line. The only thing preventing the first point from being overwritten is the fact that it's using different field keys. Trying to get InfluxDB and/or Kapacitor to account for this edge case will require a lot of effort to solve in a reliable way.

@desa
Copy link
Contributor Author

desa commented Feb 22, 2017

I agree that Telegraf is the easiest way to fix the issue, but as @sparrc notes, there are multiple reasons why I metric may be split.

IMO it should be fixed in Kapacitor then.

Metrics can be split for any number of reasons, including by the UDP client itself in order to split metrics with many fields into multiple packets. Telegraf doesn't have any guarantee that all fields for a single measurement will arrive on a single line-protocol line and the InfluxDB client doesn't guarantee that either.

IMO this is a good design decision from InfluxDB's perspective and I think Kapacitor should do the same.

@desa
Copy link
Contributor Author

desa commented Feb 22, 2017

I think there's a larger question here about how/where we should talk about these kinds of issues that have cross platform implications.

@sparrc
Copy link

sparrc commented Feb 22, 2017

Trying to get InfluxDB and/or Kapacitor to account for this edge case will require a lot of effort to solve in a reliable way.

I see that the documentation seems to state that they have to be separated, but InfluxDB accepts metrics that are split across multiple lines/inserts. Breaking this behavior would be a huge change.

For example, this works in InfluxDB (and is how Telegraf expects InfluxDB to work):

> insert test valueA=99 1487776487694340900
> insert test valueB=99 1487776487694340900
> select * from test
name: test
time                valueA valueB
----                ------ ------
1487776487694340900 99     99

@markuskont
Copy link

I believe that I stated in my original report that batch kapacitor batch processing works properly, as query is offloaded to influx. However, stream nodes will fail.

I understand perfectly well the reasoning for such split. For example, today I witnessed same behavior with ceph input plugin which produces great deal more metrics. Forwarding too many fields on a single line will be unfeasible at one point.

Nevertheless, this issue is very confusing from a user standpoint. Stream is the recommended operating mode for kapacitor, and most documentation examples are built upon that. One would assume that tickscript which works fine with CPU measurement should yield a similar result when modified for system. But this simply isn't the case. It is also likely to cause problems for people who try to use tick script templates and variables.

@nathanielc
Copy link
Contributor

I do have two guesses at a solution but first i think it is important to frame the problem in how Kapacitor and InfluxDB define "points".

InfluxDB's behavior is to implicitly merge points that have the same timestamp. This is a result of how the data is stored on disk.

Kapacitor defines a point based on the definition given in the protocol:

Each line, separated by the newline character \n, represents a single point in InfluxDB

In other words Kapacitor defines a point as, any number of fields + timestamp that arrived in the same line.

So here are my two suggestions:

  • Have Kapacitor buffer points internally and merge points that have the same timestamp. This would be a breaking change. To me this feels like a no start because of the breaking change.
  • Change the client that Telegraf uses to write to InfluxDB. Since the client batches writes (a.k.a it already has a buffer) it would be trivial to merge duplicate points before a write is sent over the wire. This way preserves Kapacitor's definition of a point and has no effect on InfluxDB. @sparrc Can you comment to the feasibility of such a change? My thinking is that a quick check to merge points with the same timestamp and measurement into a single point before a batch write would be easy to implement.
    And points across batch writes are not likely to have the same timestamp. Is that a true assumption?

@sparrc
Copy link

sparrc commented Feb 23, 2017

My thinking is that a quick check to merge points with the same timestamp and measurement into a single point before a batch write would be easy to implement. And points across batch writes are not likely to have the same timestamp. Is that a true assumption?

Unfortunately this is not true. If we assumed this to be true it would work most of the time, but would break in two scenarios that I can think of:

Telegraf will send a batch whenever it's internal buffer fills up. If the user has a large number of metrics passing through telegraf, then it will fill frequently, and metrics could have the same timestamp across many batches.

The other way this would break is one I already mentioned: UDP packets. We would need to change the behavior of telegraf to reject sending UDP packets over a certain size rather than splitting them up into multiple points.

I'm not sure what the best solution for this is. Seems like we might want to make a separate output plugin for kapacitor that would do it's own buffering, holding onto multiple batches until it sees a metric with a new timestamp.

@phemmer
Copy link

phemmer commented Feb 23, 2017

I personally like the idea of the stream node buffering points.
In addition to letting us merge points with the same series & timestamp, it would also let us re-order points that arrive out of time order (kapacitor doesn't currently handle such very well).

@nathanielc
Copy link
Contributor

@phemmer Interesting I would have expected the cost to buffer would be too much. If you want to buffer and reorder points how long are you willing to buffer the data and at what penalty to the latency of your data?

@desa
Copy link
Contributor Author

desa commented Feb 23, 2017

IMO buffering data like this should not be the default functionality. It should be opt-in, we could add a dedupe node that wants a stream and provides a stream where there is a specific window of time where the data will be buffered.

@phemmer
Copy link

phemmer commented Feb 23, 2017

I would think it be configurable, that way each user can tailor it to their use case. Something like stream().bufferSize(100).bufferTime(1s), which would buffer up to 100 points, and a for a max of 1 second. The time could either be measured as wall time since the oldest point was received (and when that time expires, it and all points sorted before it are flushed), or the timestamp between the oldest point and the newest point.

There are 2 main cases I can think of where buffering would be very useful for fixing out-of-order points.

  1. Telegraf. If telegraf emits at a 10 second interval, kapacitor will have to buffer for greater than 10 seconds. Otherwise host A could emit a point, and then 9.9 seconds later host B could emit a point with a timestamp before the other one.
  2. Time jitter due to network latency and clock synchronization among hosts.

I think the main reason this isn't more of a problem is that most people group by host, so within the same host the points are in order. But if you want to operate on data across hosts, then you're in a major pickle. We've tried to do this and have had to change designs (sometimes by not using kapacitor) because of the issue.

@nathanielc
Copy link
Contributor

@phemmer Thanks this is great feedback!

@phemmer
Copy link

phemmer commented Feb 24, 2017

Some additional thoughts on the buffer idea.

The buffering should be done via a node, and not part of stream(). This is so that you can do something like inject a where() node before the buffer node, and filter out what gets buffered, reducing computation & memory requirements.

Also nodes across different tick scripts that are configured the same way, with parents that are also configured the same way can be shared. This way N scripts buffering the same points don't use N times the memory.
For example if you have one script that does: stream()|eval(bar).as('baz')|where(abc)... and another script which does stream()|eval(bar).as('baz')|where(def), the stream() and eval() node objects can be shared by both scripts as they'll both emit the same points.

 

The time could either be measured as wall time since the oldest point was received (and when that time expires, it and all points sorted before it are flushed), or the timestamp between the oldest point and the newest point.

Should use the former. If you have a buffer time of 10 seconds, and you somehow receive a point timestamped a year in the future, everything immediately becomes unbuffered because they're more than 10 seconds before this errant point. And this errant point will never leave the buffer (or at least it won't until a year goes by), because nothing is newer than it to flush it out.
So in effect: point comes in, sort it in to the stack. if it's at the top of the stack, set a timer on the point to flush it out. When it flushes, all points before it get flushed with it.

@austinbutler
Copy link

Is there any way to work around this until it's fixed?

@desa
Copy link
Contributor Author

desa commented May 22, 2017

@austinbutler The best work around I can currently think of is to use a batch task instead of a stream task.

@desa
Copy link
Contributor Author

desa commented Jun 1, 2017

Just got another report of a similar issue with the mysql telegraf plugin. Any thoughts on the matter @danielnelson?

In the meantime, for v1.3 of Kapacitor I have something that should work as a temporary work around. The idea is to use the newly added isPresent function.

var data = stream
    |from()
        .measurement('system')

var load = data|where(lambda: isPresent("load1"))
var uptime = data|where(lambda: isPresent("uptime"))

load
    |join(uptime)
      .as('l', 'u')
    |eval(lambda: "l.load1" / "u.uptime")
        .as('ratio')

This may need a bit more tweaking to make it work, but the general idea is there.

@desa
Copy link
Contributor Author

desa commented Jun 1, 2017

@markuskont @austinbutler ^^^^

@danielnelson
Copy link

@desa I agree with the points sparrc made, to me it seems any real fix must be done in Kapacitor. However, if we are adding metrics separately in Telegraf we should probably change it. It looks like the mysql plugin does indeed do this, can you open an issue?

@austinbutler
Copy link

@desa I worry that would just skip some points, is that not the case? In other words, it's not just making it so the error goes away is it? Will it wait to do the eval until both fields come in?

@ss682
Copy link

ss682 commented Nov 6, 2017

@phemmer What is the difference between using a window and a buffer? Aren't they doing more or less the same namely capturing data received over some time?
So if window doesn't work with stream, why should a buffer work?

@phemmer
Copy link

phemmer commented Nov 6, 2017

@ss682 The window node would have to hold at least one point in its buffer at all times. Possibly more depending on how out-of-order your points may be (and this sizing would have to be hard coded in the script, which may not work in dynamic environments). This means that your data will be delayed by at least whatever the interval between points is. Even with a minimal window of 1 point, kapacitor already uses point buffering on a lot of different nodes, so that once you've chained several of them together, your data is horribly delayed. This is especially bad when your data comes in infrequently.

The proposed buffer node would be wall-clock time based, meaning that instead of holding N points it holds X seconds of points. The duration would likely be in the millisecond range, and the impact would be minimal.

I also don't believe a window node does point re-ordering. But not sure on this, and it could be adjusted to do so if it does not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants