Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.9.6.1 shows gaps #5214

Closed
pfischermx opened this issue Dec 23, 2015 · 5 comments
Closed

0.9.6.1 shows gaps #5214

pfischermx opened this issue Dec 23, 2015 · 5 comments
Labels

Comments

@pfischermx
Copy link

First off, I think this version has so far been pretty stable, congrats for that.

However since we upgraded to this version we see some temporary gaps that then automatically disappear.

As a reminder, we push metrics via UDP (we have a few UDP ports and we hash our metrics through them). I looked into our injection logs and it does not show any gap on writing (also the injection part has not changed in the last weeks), the same thing goes with the influxdb logs. The closest to a non query/http entry is:

[retention] 2015/12/23 19:14:25 retention policy shard deletion check commencing
[retention] 2015/12/23 19:14:25 retention policy enforcement check commencing
[tsm1] 2015/12/23 19:15:21 compacted 22 tsm into 1 files in 3m6.078521546s
[tsm1] 2015/12/23 19:15:21 compacting 4 TSM files
[tsm1] 2015/12/23 19:15:42 compacted 4 tsm into 1 files in 20.213362013s
[tsm1] 2015/12/23 19:16:22 compacting 3 TSM files
[tsm1] 2015/12/23 19:16:52 compacted 3 tsm into 1 files in 30.479634201s
[tsm1] 2015/12/23 19:16:52 compacting 3 TSM files
[retention] 2015/12/23 19:44:25 retention policy enforcement check commencing
[retention] 2015/12/23 19:44:25 retention policy shard deletion check commencing

Any other graphs or stats you find worthy? You have a grafana dashboard that uses _internal that I could copy and use?

screen shot 2015-12-23 at 11 48 52 am

screen shot 2015-12-23 at 11 47 02 am

@pfischermx
Copy link
Author

Forgot to mention that this is our influxdb config:

reporting-disabled = true

[meta]
  dir = "/usr/local/var/influxdb/meta"
  bind-address = ":8088"
  retention-autocreate = true
  election-timeout = "1s"
  heartbeat-timeout = "1s"
  leader-lease-timeout = "500ms"
  commit-timeout = "50ms"
  cluster-tracing = false

[data]
  dir = "/usr/local/var/influxdb/data"
  engine = "tsm1"
  max-wal-size = 1073741824
  wal-flush-interval = "2m0s"
  wal-max-memory-size-threshold = 524288000
  wal-partition-flush-delay = "2s"
  wal-dir = "/usr/local/var/influxdb/wal"
  wal-logging-enabled = true
  wal-ready-series-size = 30720
  wal-compaction-threshold = 0.5
  wal-max-series-size = 1048576
  wal-flush-cold-interval = "5s"
  wal-partition-size-threshold = 20971520

[cluster]
  force-remote-mapping = false
  write-timeout = "5s"
  shard-writer-timeout = "5s"
  shard-mapper-timeout = "5s"

[retention]
  enabled = true
  check-interval = "30m0s"

[shard-precreation]
  enabled = true
  check-interval = "10m0s"
  advance-period = "30m0s"

[admin]
  enabled = true
  bind-address = ":8083"
  https-enabled = false
  https-certificate = "/usr/local/conf/influxdb/ssl/influx.pem"

[monitor]
  store-enabled = true
  store-database = "_internal"
  store-interval = "10s"

[http]
  enabled = true
  bind-address = ":8086"
  auth-enabled = false
  log-enabled = true
  write-tracing = false
  pprof-enabled = false
  https-enabled = false
  https-certificate = "/usr/local/conf/influxdb/ssl/influx.pem"

[collectd]
  enabled = false
  bind-address = ":25826"
  database = "collectd"
  retention-policy = ""
  batch-size = 1000
  batch-pending = 5
  batch-timeout = "10s"
  typesdb = "/usr/share/collectd/types.db"

[opentsdb]
  enabled = false
  bind-address = ":4242"
  database = "opentsdb"
  retention-policy = ""
  consistency-level = "one"
  tls-enabled = false
  certificate = "/usr/local/conf/influxdb/ssl/influx.pem"
  batch-size = 1000
  batch-pending = 5
  batch-timeout = "1s"

[continuous_queries]
  log-enabled = true
  enabled = true
  recompute-previous-n = 2
  recompute-no-older-than = "10m0s"
  compute-runs-per-interval = 10
  compute-no-more-than = "2m0s"

[hinted-handoff]
  enabled = true
  dir = "/usr/local/var/influxdb/hh"
  max-size = 1073741824
  max-age = "168h0m0s"
  retry-rate-limit = 0
  retry-interval = "1s"

[[udp]]
enabled = true
bind-address = ":4340"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4341"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4342"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4343"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4344"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4345"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4346"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4347"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4348"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4349"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4440"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4441"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4442"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4443"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4444"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4445"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4446"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4447"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4448"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"

[[udp]]
enabled = true
bind-address = ":4449"
database = "metrics"
batch-size = 1000
batch-pending = 5
batch-timeout = "1s"


@daviesalex
Copy link
Contributor

We saw something very similar to this. You generally wont see error logs, if you are dropping. Do you see drops in the kernel (ss) - we did when we went digging.

You may want to review #5201 and/or move to a nightly since yesterday. Combined with #5217 and moving our WAL onto a very, very high performing IO device we believe we have eliminated drops.

@sebito91
Copy link
Contributor

concur with @daviesalex recommendation, especially given the sheer number of udp endpoints you have. Out of curiosity, are you using telegraf for your metrics engine or a custom generator?

@pfischermx
Copy link
Author

thanks. I'll give a try to nightly and increase our UDP payload on the config (we already increased our UDP buffer via sysctl).

As wrt to telegraph, no, I've a perl script that pulls data from ganglia and then throws it to influxdb. It has been working fairly well, each ganglia host forks into a new process and then the metrics are sent to influxdb with a hash load-balancing algo (just based on metric name).

@jwilder
Copy link
Contributor

jwilder commented Jan 21, 2016

Are you still seeing this with 0.10 nightlies?

@jwilder jwilder closed this as completed Mar 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants