0.9.6.1 shows gaps #5214

pfischermx · 2015-12-23T19:59:12Z

First off, I think this version has so far been pretty stable, congrats for that.

However since we upgraded to this version we see some temporary gaps that then automatically disappear.

As a reminder, we push metrics via UDP (we have a few UDP ports and we hash our metrics through them). I looked into our injection logs and it does not show any gap on writing (also the injection part has not changed in the last weeks), the same thing goes with the influxdb logs. The closest to a non query/http entry is:

[retention] 2015/12/23 19:14:25 retention policy shard deletion check commencing
[retention] 2015/12/23 19:14:25 retention policy enforcement check commencing
[tsm1] 2015/12/23 19:15:21 compacted 22 tsm into 1 files in 3m6.078521546s
[tsm1] 2015/12/23 19:15:21 compacting 4 TSM files
[tsm1] 2015/12/23 19:15:42 compacted 4 tsm into 1 files in 20.213362013s
[tsm1] 2015/12/23 19:16:22 compacting 3 TSM files
[tsm1] 2015/12/23 19:16:52 compacted 3 tsm into 1 files in 30.479634201s
[tsm1] 2015/12/23 19:16:52 compacting 3 TSM files
[retention] 2015/12/23 19:44:25 retention policy enforcement check commencing
[retention] 2015/12/23 19:44:25 retention policy shard deletion check commencing

Any other graphs or stats you find worthy? You have a grafana dashboard that uses _internal that I could copy and use?

pfischermx · 2015-12-23T20:01:00Z

Forgot to mention that this is our influxdb config:

reporting-disabled = true

[meta]
  dir = "/usr/local/var/influxdb/meta"
  bind-address = ":8088"
  retention-autocreate = true
  election-timeout = "1s"
  heartbeat-timeout = "1s"
  leader-lease-timeout = "500ms"
  commit-timeout = "50ms"
  cluster-tracing = false

[data]
  dir = "/usr/local/var/influxdb/data"
  engine = "tsm1"
  max-wal-size = 1073741824
  wal-flush-interval = "2m0s"
  wal-max-memory-size-threshold = 524288000
  wal-partition-flush-delay = "2s"
  wal-dir = "/usr/local/var/influxdb/wal"
  wal-logging-enabled = true
  wal-ready-series-size = 30720
  wal-compaction-threshold = 0.5
  wal-max-series-size = 1048576
  wal-flush-cold-interval = "5s"
  wal-partition-size-threshold = 20971520

[cluster]
  force-remote-mapping = false
  write-timeout = "5s"
  shard-writer-timeout = "5s"
  shard-mapper-timeout = "5s"

[retention]
  enabled = true
  check-interval = "30m0s"

[shard-precreation]
  enabled = true
  check-interval = "10m0s"
  advance-period = "30m0s"

[admin]
  enabled = true
  bind-address = ":8083"
  https-enabled = false
  https-certificate = "/usr/local/conf/influxdb/ssl/influx.pem"

[monitor]
  store-enabled = true
  store-database = "_internal"
  store-interval = "10s"

[http]
  enabled = true
  bind-address = ":8086"
  auth-enabled = false
  log-enabled = true
  write-tracing = false
  pprof-enabled = false
  https-enabled = false
  https-certificate = "/usr/local/conf/influxdb/ssl/influx.pem"

[collectd]
  enabled = false
  bind-address = ":25826"
  database = "collectd"
  retention-policy = ""
  batch-size = 1000
  batch-pending = 5
  batch-timeout = "10s"
  typesdb = "/usr/share/collectd/types.db"

[opentsdb]
  enabled = false
  bind-address = ":4242"
  database = "opentsdb"
  retention-policy = ""
  consistency-level = "one"
  tls-enabled = false
  certificate = "/usr/local/conf/influxdb/ssl/influx.pem"
  batch-size = 1000
  batch-pending = 5
  batch-timeout = "1s"

[continuous_queries]
  log-enabled = true
  enabled = true
  recompute-previous-n = 2
  recompute-no-older-than = "10m0s"
  compute-runs-per-interval = 10
  compute-no-more-than = "2m0s"

[hinted-handoff]
  enabled = true
  dir = "/usr/local/var/influxdb/hh"
  max-size = 1073741824
  max-age = "168h0m0s"
  retry-rate-limit = 0
  retry-interval = "1s"

[[udp]]
enabled = true
bind-address = ":4340"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4341"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4342"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4343"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4344"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4345"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4346"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4347"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4348"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4349"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4440"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4441"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4442"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4443"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4444"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4445"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4446"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4447"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4448"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4449"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

daviesalex · 2015-12-23T21:34:12Z

We saw something very similar to this. You generally wont see error logs, if you are dropping. Do you see drops in the kernel (ss) - we did when we went digging.

You may want to review #5201 and/or move to a nightly since yesterday. Combined with #5217 and moving our WAL onto a very, very high performing IO device we believe we have eliminated drops.

sebito91 · 2015-12-23T21:46:22Z

concur with @daviesalex recommendation, especially given the sheer number of udp endpoints you have. Out of curiosity, are you using telegraf for your metrics engine or a custom generator?

pfischermx · 2015-12-30T18:57:12Z

thanks. I'll give a try to nightly and increase our UDP payload on the config (we already increased our UDP buffer via sysctl).

As wrt to telegraph, no, I've a perl script that pulls data from ganglia and then throws it to influxdb. It has been working fairly well, each ganglia host forks into a new process and then the metrics are sent to influxdb with a hash load-balancing algo (just based on metric name).

jwilder · 2016-01-21T16:20:52Z

Are you still seeing this with 0.10 nightlies?

otoolep added the area/tsm label Dec 23, 2015

jwilder closed this as completed Mar 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.9.6.1 shows gaps #5214

0.9.6.1 shows gaps #5214

pfischermx commented Dec 23, 2015

pfischermx commented Dec 23, 2015

daviesalex commented Dec 23, 2015

sebito91 commented Dec 23, 2015

pfischermx commented Dec 30, 2015

jwilder commented Jan 21, 2016

0.9.6.1 shows gaps #5214

0.9.6.1 shows gaps #5214

Comments

pfischermx commented Dec 23, 2015

pfischermx commented Dec 23, 2015

daviesalex commented Dec 23, 2015

sebito91 commented Dec 23, 2015

pfischermx commented Dec 30, 2015

jwilder commented Jan 21, 2016