InfluxDB 0.10.1 with Telegraf 0.10.3 udp failing under increased load #5788

allen13 · 2016-02-23T02:52:15Z

I am currently using using influxdb/telegraf to monitor two environments of differing size. One has about 80 servers the other 250. udp is failing in the larger environment for unknown reasons. The server itself has 56-cores and raided SSDs, in addition to bonded 10GB NICs. The only error indicator I could find was the batchesTxFail metric in the _internal db. It steadily rises at about 50 per minute.

Things that have been ruled out:

Network - the local telegraf can't write to it
Telegraf - nc -lu 8089 while influxdb is stopped shows telegraf writing the proper metrics
Memory - The machine has 256GB. Also these values have been set in the config:
cache-max-memory-size = 128849018880 (120GB)
cache-snapshot-memory-size = 107374182400 (100GB)
UDP Read buffer - The sysctl and config are set to read-buffer = 1073741824 (1GB)
Any other part of InfluxDB aside from udp - switching to http works great

Http may be the more reliable way to go for now but it would be nice to reduce resource usage with udp.

jwilder · 2016-02-23T03:22:03Z

Hard to say off-hand what the issue is, but cache-snapshot-memory-size is way too high. That value is the threshold at which the in-memory cached values are compacted to TSM files. That should really be left at the default value of ~25MB.

jonseymour · 2016-02-23T04:36:48Z

@allen13 if you are willing to run code currently in master (e.g. this isn't a critical production system) the code currently in master has #5758, which implements some cache throughput metrics which are described here #5499.

Comparing the cache throughput with the compacted bytes per second (see explanation in the base of #5499) may give you some idea about if whether your disks are falling behind your inbound traffic. That said, I'd agree with @jwilder that your snapshot memory size is probably too high.

Again only if you can afford to lose data, it might be informative to take a stack dump of the server that is under high UDP load. If there is a bottleneck somewhere in that path, it might be obvious from an analysis of the stack dumps. To obtain a stack dump, kill the influxd process with -QUIT. Note that this will stop the server and there is a small risk of data loss (e.g. the risk that the software which is designed to minimize data loss isn't working 100% reliably).

jonseymour · 2016-02-23T04:41:53Z

BTW @allen13 - if you do switch to a master build make sure to backup first, particularly the meta directory because of this issue (#5772) and don't do it if you can't afford the risk of data loss

update: and just in case there is confusion - I don't speak for influx, I am just an interested user who contributed some of the cache statistics code.

allen13 · 2016-02-23T11:25:28Z

Data loss is not an issue I have wiped the whole /var/lib/influxdb a few times during troubleshooting. I'll definitely set the snapshot size back to default. Trying --QUIT next, master after that.

allen13 · 2016-02-23T14:01:02Z

Couldn't find anything obviously wrong in the -QUIT stack dump. Can't spend anymore time on debugging so I'm switching back to http until udp becomes more stable and performant. Thanks for the help!

jonseymour · 2016-02-23T15:45:39Z

@allen13 I am not 100% sure, but it might be hard for the UDP interface to ever perform as well as HTTP because the number of points that can fit in a legal UDP packet will be much smaller than what can be streamed in an HTTP payload. What you might be seeing is simply the costs associated with fragmenting data into UDP packets.

If this is the problem, then putting a batching UDP gateway between the UDP source and the influx HTTP endpoint might allow you take advantage of influx's HTTP performance and allow you to tune the memory associated with aggregating UDP packets much better. Of course, you probably need to write some code to achieve this.

That said, I have zero experience with UDP and influx so I can't really comment on how influx's UDP server performs - take everything I just said with a large grain of salt.

allen13 closed this as completed Feb 23, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InfluxDB 0.10.1 with Telegraf 0.10.3 udp failing under increased load #5788

InfluxDB 0.10.1 with Telegraf 0.10.3 udp failing under increased load #5788

allen13 commented Feb 23, 2016

jwilder commented Feb 23, 2016

jonseymour commented Feb 23, 2016

jonseymour commented Feb 23, 2016

allen13 commented Feb 23, 2016

allen13 commented Feb 23, 2016

jonseymour commented Feb 23, 2016

InfluxDB 0.10.1 with Telegraf 0.10.3 udp failing under increased load #5788

InfluxDB 0.10.1 with Telegraf 0.10.3 udp failing under increased load #5788

Comments

allen13 commented Feb 23, 2016

jwilder commented Feb 23, 2016

jonseymour commented Feb 23, 2016

jonseymour commented Feb 23, 2016

allen13 commented Feb 23, 2016

allen13 commented Feb 23, 2016

jonseymour commented Feb 23, 2016