Simple TICKscript that will always error #1217

desa · 2017-02-22T14:16:02Z

Bug report

Hey. For some reason, the system measurement (load1,5,15, uptime, etc) is sent by telegraf as two distinct lines.

system,host=TICKAlerta load1=0,load5=0.03,load15=0.03,n_users=1i,n_cpus=2i 1487579590000000000
system,host=TICKAlerta uptime_format=" 0:13",uptime=807i 1487579590000000000

This essentially breaks stream processing with kapacitor of system measurement, as relevant field is missing 50% of the times.

E! error evaluating expression for level CRITICAL: no field or tag exists for load1

This can be verified while looking into datapoints, as seen by kapacitor

{
 "Name": "system",
 "Database": "telegraf",
 "RetentionPolicy": "default",
 "Group": "host=host1.ex",
 "Dimensions": {
   "ByName": false,
   "TagNames": [
     "host"
   ]
 },
 "Tags": {
   "environment": "offsite",
   "host": "host1.ex",
   "osname": "Ubuntu",
   "virtual": "physical"
 },
 "Fields": {
   "load1": 0,
   "load15": 0.05,
   "load5": 0.01,
   "n_cpus": 4,
   "n_users": 0
 },
 "Time": "2017-02-14T12:38:40Z"
}
{
 "Name": "system",
 "Database": "telegraf",
 "RetentionPolicy": "default",
 "Group": "host=host1.ex",
 "Dimensions": {
   "ByName": false,
   "TagNames": [
     "host"
   ]
 },
 "Tags": {
   "environment": "offsite",
   "host": "host1.ex",
   "osname": "Ubuntu",
   "virtual": "physical"
 },
 "Fields": {
   "uptime": 5278035,
   "uptime_format": "61 days,  2:07"
 },
 "Time": "2017-02-14T12:38:40Z"
}

Relevant telegraf.conf:

[global_tags]
[agent]
 interval = "10s"
 round_interval = true

 metric_batch_size = 1000

 metric_buffer_limit = 10000

 collection_jitter = "0s"

 flush_interval = "10s"
 flush_jitter = "0s"

 precision = ""

 debug = false
 quiet = false
 logfile = ""

 hostname = ""
 omit_hostname = false

[[outputs.file]]
  files = ["stdout", "/tmp/metrics.out"]

  data_format = "influx"

[[inputs.system]]
 # no configuration

System info:

As far as I know, present in telegraf versions 1.0, 1.1 and 1.2. Tested on Ubuntu and Debian LTS versions (precise, trusty, xenial, jessie).

Steps to reproduce:

Telegraf

Use the included telegraf config file.

telegraf --config telegraf.conf --debug
cat /tmp/metrics.out

Kapacitor

var warn_threshold = 4
var crit_threshold = 10

var period = 1h
var every = 1m

var data = stream
 |from()
   .database('telegraf')
   .retentionPolicy('default')
   .measurement('system')
   .groupBy('host')
 |log()
 |window()
   .period(period)
   .every(every)
 |last('load1')
   .as('stat')

grep load1 /var/log/kapacitor/kapacitor.log

Expected behavior:

Single line for system measurement.

Actual behavior:

Two distinct lines for system measurement

Since the data for a series is split into two lines, InfluxDB forwards it on to Kapacitor as two lines and therefore its possible to write a TICKscript that will perpetually fail

var data = stream
    |from()
        .measurement('system')
    |eval(lambda: "load1" / "uptime")
        .as('ratio')

with

kapacitor define example -tick example.tick -dbrp telegraf.autogen -type stream
kapacitor enable example

will perpetually error out, even though all of the data would exist.

As I see it, there are four ways to solve this problem.

Kapacitor can do deduplication of incoming lines with common series key and timestamp.
InfluxDB can do deduplication of data it reports in subscriptions for lines with common series key and timestamp.
Telegraf can report data as a single line
We can provide documentation that notes this as a rough edge and give users a work around.

I had initially directed @markuskont to opening an issue on Telegraf, but there's definitely more than one way to solve this issue.

The text was updated successfully, but these errors were encountered:

rossmcdonald · 2017-02-22T15:04:42Z

I think the easiest solution here will be to fix Telegraf, which is really sending one point on two separate lines. In my opinion, this type of formatting is also technically not valid line protocol, from the docs:

Each line, separated by the newline character \n, represents a single point in InfluxDB

Where a single point should be represented by a single line. The only thing preventing the first point from being overwritten is the fact that it's using different field keys. Trying to get InfluxDB and/or Kapacitor to account for this edge case will require a lot of effort to solve in a reliable way.

desa · 2017-02-22T15:09:35Z

I agree that Telegraf is the easiest way to fix the issue, but as @sparrc notes, there are multiple reasons why I metric may be split.

IMO it should be fixed in Kapacitor then.

Metrics can be split for any number of reasons, including by the UDP client itself in order to split metrics with many fields into multiple packets. Telegraf doesn't have any guarantee that all fields for a single measurement will arrive on a single line-protocol line and the InfluxDB client doesn't guarantee that either.

IMO this is a good design decision from InfluxDB's perspective and I think Kapacitor should do the same.

desa · 2017-02-22T15:11:30Z

I think there's a larger question here about how/where we should talk about these kinds of issues that have cross platform implications.

sparrc · 2017-02-22T15:17:28Z

Trying to get InfluxDB and/or Kapacitor to account for this edge case will require a lot of effort to solve in a reliable way.

I see that the documentation seems to state that they have to be separated, but InfluxDB accepts metrics that are split across multiple lines/inserts. Breaking this behavior would be a huge change.

For example, this works in InfluxDB (and is how Telegraf expects InfluxDB to work):

> insert test valueA=99 1487776487694340900
> insert test valueB=99 1487776487694340900
> select * from test
name: test
time                valueA valueB
----                ------ ------
1487776487694340900 99     99

markuskont · 2017-02-22T15:39:55Z

I believe that I stated in my original report that batch kapacitor batch processing works properly, as query is offloaded to influx. However, stream nodes will fail.

I understand perfectly well the reasoning for such split. For example, today I witnessed same behavior with ceph input plugin which produces great deal more metrics. Forwarding too many fields on a single line will be unfeasible at one point.

Nevertheless, this issue is very confusing from a user standpoint. Stream is the recommended operating mode for kapacitor, and most documentation examples are built upon that. One would assume that tickscript which works fine with CPU measurement should yield a similar result when modified for system. But this simply isn't the case. It is also likely to cause problems for people who try to use tick script templates and variables.

nathanielc · 2017-02-22T18:13:01Z

I do have two guesses at a solution but first i think it is important to frame the problem in how Kapacitor and InfluxDB define "points".

InfluxDB's behavior is to implicitly merge points that have the same timestamp. This is a result of how the data is stored on disk.

Kapacitor defines a point based on the definition given in the protocol:

Each line, separated by the newline character \n, represents a single point in InfluxDB

In other words Kapacitor defines a point as, any number of fields + timestamp that arrived in the same line.

So here are my two suggestions:

Have Kapacitor buffer points internally and merge points that have the same timestamp. This would be a breaking change. To me this feels like a no start because of the breaking change.
Change the client that Telegraf uses to write to InfluxDB. Since the client batches writes (a.k.a it already has a buffer) it would be trivial to merge duplicate points before a write is sent over the wire. This way preserves Kapacitor's definition of a point and has no effect on InfluxDB. @sparrc Can you comment to the feasibility of such a change? My thinking is that a quick check to merge points with the same timestamp and measurement into a single point before a batch write would be easy to implement.
And points across batch writes are not likely to have the same timestamp. Is that a true assumption?

sparrc · 2017-02-23T10:16:31Z

My thinking is that a quick check to merge points with the same timestamp and measurement into a single point before a batch write would be easy to implement. And points across batch writes are not likely to have the same timestamp. Is that a true assumption?

Unfortunately this is not true. If we assumed this to be true it would work most of the time, but would break in two scenarios that I can think of:

Telegraf will send a batch whenever it's internal buffer fills up. If the user has a large number of metrics passing through telegraf, then it will fill frequently, and metrics could have the same timestamp across many batches.

The other way this would break is one I already mentioned: UDP packets. We would need to change the behavior of telegraf to reject sending UDP packets over a certain size rather than splitting them up into multiple points.

I'm not sure what the best solution for this is. Seems like we might want to make a separate output plugin for kapacitor that would do it's own buffering, holding onto multiple batches until it sees a metric with a new timestamp.

phemmer · 2017-02-23T16:24:18Z

I personally like the idea of the stream node buffering points.
In addition to letting us merge points with the same series & timestamp, it would also let us re-order points that arrive out of time order (kapacitor doesn't currently handle such very well).

nathanielc · 2017-02-23T16:31:27Z

@phemmer Interesting I would have expected the cost to buffer would be too much. If you want to buffer and reorder points how long are you willing to buffer the data and at what penalty to the latency of your data?

desa · 2017-02-23T16:44:10Z

IMO buffering data like this should not be the default functionality. It should be opt-in, we could add a dedupe node that wants a stream and provides a stream where there is a specific window of time where the data will be buffered.

phemmer · 2017-02-23T16:48:49Z

I would think it be configurable, that way each user can tailor it to their use case. Something like stream().bufferSize(100).bufferTime(1s), which would buffer up to 100 points, and a for a max of 1 second. The time could either be measured as wall time since the oldest point was received (and when that time expires, it and all points sorted before it are flushed), or the timestamp between the oldest point and the newest point.

There are 2 main cases I can think of where buffering would be very useful for fixing out-of-order points.

Telegraf. If telegraf emits at a 10 second interval, kapacitor will have to buffer for greater than 10 seconds. Otherwise host A could emit a point, and then 9.9 seconds later host B could emit a point with a timestamp before the other one.
Time jitter due to network latency and clock synchronization among hosts.

I think the main reason this isn't more of a problem is that most people group by host, so within the same host the points are in order. But if you want to operate on data across hosts, then you're in a major pickle. We've tried to do this and have had to change designs (sometimes by not using kapacitor) because of the issue.

nathanielc · 2017-02-23T16:51:52Z

@phemmer Thanks this is great feedback!

phemmer · 2017-02-24T05:40:23Z

Some additional thoughts on the buffer idea.

The buffering should be done via a node, and not part of stream(). This is so that you can do something like inject a where() node before the buffer node, and filter out what gets buffered, reducing computation & memory requirements.

Also nodes across different tick scripts that are configured the same way, with parents that are also configured the same way can be shared. This way N scripts buffering the same points don't use N times the memory.
For example if you have one script that does: stream()|eval(bar).as('baz')|where(abc)... and another script which does stream()|eval(bar).as('baz')|where(def), the stream() and eval() node objects can be shared by both scripts as they'll both emit the same points.

The time could either be measured as wall time since the oldest point was received (and when that time expires, it and all points sorted before it are flushed), or the timestamp between the oldest point and the newest point.

Should use the former. If you have a buffer time of 10 seconds, and you somehow receive a point timestamped a year in the future, everything immediately becomes unbuffered because they're more than 10 seconds before this errant point. And this errant point will never leave the buffer (or at least it won't until a year goes by), because nothing is newer than it to flush it out.
So in effect: point comes in, sort it in to the stack. if it's at the top of the stack, set a timer on the point to flush it out. When it flushes, all points before it get flushed with it.

austinbutler · 2017-05-22T02:35:21Z

Is there any way to work around this until it's fixed?

desa · 2017-05-22T14:30:06Z

@austinbutler The best work around I can currently think of is to use a batch task instead of a stream task.

desa · 2017-06-01T18:54:02Z

Just got another report of a similar issue with the mysql telegraf plugin. Any thoughts on the matter @danielnelson?

In the meantime, for v1.3 of Kapacitor I have something that should work as a temporary work around. The idea is to use the newly added isPresent function.

var data = stream
    |from()
        .measurement('system')

var load = data|where(lambda: isPresent("load1"))
var uptime = data|where(lambda: isPresent("uptime"))

load
    |join(uptime)
      .as('l', 'u')
    |eval(lambda: "l.load1" / "u.uptime")
        .as('ratio')

This may need a bit more tweaking to make it work, but the general idea is there.

desa · 2017-06-01T18:54:29Z

@markuskont @austinbutler ^^^^

danielnelson · 2017-06-01T19:27:49Z

@desa I agree with the points sparrc made, to me it seems any real fix must be done in Kapacitor. However, if we are adding metrics separately in Telegraf we should probably change it. It looks like the mysql plugin does indeed do this, can you open an issue?

austinbutler · 2017-08-22T07:25:48Z

@desa I worry that would just skip some points, is that not the case? In other words, it's not just making it so the error goes away is it? Will it wait to do the eval until both fields come in?

ss682 · 2017-11-06T09:59:02Z

@phemmer What is the difference between using a window and a buffer? Aren't they doing more or less the same namely capturing data received over some time?
So if window doesn't work with stream, why should a buffer work?

phemmer · 2017-11-06T13:41:38Z

@ss682 The window node would have to hold at least one point in its buffer at all times. Possibly more depending on how out-of-order your points may be (and this sizing would have to be hard coded in the script, which may not work in dynamic environments). This means that your data will be delayed by at least whatever the interval between points is. Even with a minimal window of 1 point, kapacitor already uses point buffering on a lot of different nodes, so that once you've chained several of them together, your data is horribly delayed. This is especially bad when your data comes in infrequently.

The proposed buffer node would be wall-clock time based, meaning that instead of holding N points it holds X seconds of points. The duration would likely be in the millisecond range, and the impact would be minimal.

I also don't believe a window node does point re-ordering. But not sure on this, and it could be adjusted to do so if it does not.

danielnelson mentioned this issue Sep 26, 2017

Subscription separates one point into two influxdata/influxdb#8869

Closed

thuck mentioned this issue Oct 14, 2017

kapacitor missing/null value problems #1604

Open

ss682 mentioned this issue Nov 6, 2017

Missing value although using Batch #1651

Closed

danielnelson mentioned this issue Mar 15, 2018

JSON serializer emits metric type information influxdata/telegraf#3834

Closed

Anderen2 mentioned this issue May 7, 2018

Missing value: load15 #1764

Open

jjbuchan mentioned this issue May 14, 2021

Kapacitor jjbuchan/docs#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple TICKscript that will always error #1217

Simple TICKscript that will always error #1217

desa commented Feb 22, 2017

Bug report

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

rossmcdonald commented Feb 22, 2017

desa commented Feb 22, 2017 •

edited

Loading

desa commented Feb 22, 2017 •

edited

Loading

sparrc commented Feb 22, 2017 •

edited

Loading

markuskont commented Feb 22, 2017

nathanielc commented Feb 22, 2017

sparrc commented Feb 23, 2017 •

edited

Loading

phemmer commented Feb 23, 2017

nathanielc commented Feb 23, 2017

desa commented Feb 23, 2017

phemmer commented Feb 23, 2017 •

edited

Loading

nathanielc commented Feb 23, 2017

phemmer commented Feb 24, 2017

austinbutler commented May 22, 2017

desa commented May 22, 2017

desa commented Jun 1, 2017

desa commented Jun 1, 2017

danielnelson commented Jun 1, 2017

austinbutler commented Aug 22, 2017

ss682 commented Nov 6, 2017

phemmer commented Nov 6, 2017

Simple TICKscript that will always error #1217

Simple TICKscript that will always error #1217

Comments

desa commented Feb 22, 2017

Bug report

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

rossmcdonald commented Feb 22, 2017

desa commented Feb 22, 2017 • edited Loading

desa commented Feb 22, 2017 • edited Loading

sparrc commented Feb 22, 2017 • edited Loading

markuskont commented Feb 22, 2017

nathanielc commented Feb 22, 2017

sparrc commented Feb 23, 2017 • edited Loading

phemmer commented Feb 23, 2017

nathanielc commented Feb 23, 2017

desa commented Feb 23, 2017

phemmer commented Feb 23, 2017 • edited Loading

nathanielc commented Feb 23, 2017

phemmer commented Feb 24, 2017

austinbutler commented May 22, 2017

desa commented May 22, 2017

desa commented Jun 1, 2017

desa commented Jun 1, 2017

danielnelson commented Jun 1, 2017

austinbutler commented Aug 22, 2017

ss682 commented Nov 6, 2017

phemmer commented Nov 6, 2017

desa commented Feb 22, 2017 •

edited

Loading

desa commented Feb 22, 2017 •

edited

Loading

sparrc commented Feb 22, 2017 •

edited

Loading

sparrc commented Feb 23, 2017 •

edited

Loading

phemmer commented Feb 23, 2017 •

edited

Loading