-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InfluxDB 0.10.1 with Telegraf 0.10.3 udp failing under increased load #5788
Comments
Hard to say off-hand what the issue is, but |
@allen13 if you are willing to run code currently in master (e.g. this isn't a critical production system) the code currently in master has #5758, which implements some cache throughput metrics which are described here #5499. Comparing the cache throughput with the compacted bytes per second (see explanation in the base of #5499) may give you some idea about if whether your disks are falling behind your inbound traffic. That said, I'd agree with @jwilder that your snapshot memory size is probably too high. Again only if you can afford to lose data, it might be informative to take a stack dump of the server that is under high UDP load. If there is a bottleneck somewhere in that path, it might be obvious from an analysis of the stack dumps. To obtain a stack dump, kill the influxd process with -QUIT. Note that this will stop the server and there is a small risk of data loss (e.g. the risk that the software which is designed to minimize data loss isn't working 100% reliably). |
BTW @allen13 - if you do switch to a master build make sure to backup first, particularly the meta directory because of this issue (#5772) and don't do it if you can't afford the risk of data loss update: and just in case there is confusion - I don't speak for influx, I am just an interested user who contributed some of the cache statistics code. |
Data loss is not an issue I have wiped the whole /var/lib/influxdb a few times during troubleshooting. I'll definitely set the snapshot size back to default. Trying --QUIT next, master after that. |
Couldn't find anything obviously wrong in the -QUIT stack dump. Can't spend anymore time on debugging so I'm switching back to http until udp becomes more stable and performant. Thanks for the help! |
@allen13 I am not 100% sure, but it might be hard for the UDP interface to ever perform as well as HTTP because the number of points that can fit in a legal UDP packet will be much smaller than what can be streamed in an HTTP payload. What you might be seeing is simply the costs associated with fragmenting data into UDP packets. If this is the problem, then putting a batching UDP gateway between the UDP source and the influx HTTP endpoint might allow you take advantage of influx's HTTP performance and allow you to tune the memory associated with aggregating UDP packets much better. Of course, you probably need to write some code to achieve this. That said, I have zero experience with UDP and influx so I can't really comment on how influx's UDP server performs - take everything I just said with a large grain of salt. |
I am currently using using influxdb/telegraf to monitor two environments of differing size. One has about 80 servers the other 250. udp is failing in the larger environment for unknown reasons. The server itself has 56-cores and raided SSDs, in addition to bonded 10GB NICs. The only error indicator I could find was the batchesTxFail metric in the _internal db. It steadily rises at about 50 per minute.
Things that have been ruled out:
Network - the local telegraf can't write to it
Telegraf - nc -lu 8089 while influxdb is stopped shows telegraf writing the proper metrics
Memory - The machine has 256GB. Also these values have been set in the config:
cache-max-memory-size = 128849018880 (120GB)
cache-snapshot-memory-size = 107374182400 (100GB)
UDP Read buffer - The sysctl and config are set to read-buffer = 1073741824 (1GB)
Any other part of InfluxDB aside from udp - switching to http works great
Http may be the more reliable way to go for now but it would be nice to reduce resource usage with udp.
The text was updated successfully, but these errors were encountered: