-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.9.2-nightly] Becomes unresponsive after a small number of insertions #3346
Comments
Thanks for the report @adventures91. Are there any continuous query errors in the logs? Can you provide examples of the continuous queries being run and the points being written? |
the other queries of 30m, 1h, and 1w are pretty much the same. I am writing the data using the influxdb-python client and the data looks like this: The only continuous query errors are at the very beginning of the log file "error: measurement not found" I assumed this was expected since the measurements do not exist yet. I can query the measurements created by the continuous queries while the db is still responsive and get the correct results when running the the same influxdb server on my local computer which has 8gb of ram, an ssd, and fedora 20 and running the same queries I have not yet had influxdb become unresponsive. Although I have not run this test for very long on my local machine. |
@adventures91 as of 0.9.1 the bug requiring a Can you paste the output of |
I can update the continuous queries and get back to you if the end result differs |
ok here is the list of updated continuous queries:
still having the same issue I get close to around 400 writes then the db becomes unresponsive |
Thanks for cleaning up the CQs, @adventures91. The new ones have correct syntax so we can likely rule out any issues with that.
Correct, that's expected when CQs try to downsample a measurement that has no data yet.
I think this might be a significant contributor to the issue. The JSON protocol is known to have performance issues, hence it's deprecation in favor of the line protocol. If the python client is inserting JSON and is not batching points, that's creating the worst possible write performance. That, coupled with CQs downsampling, might be leading to resource constraints. Does the box seem to be starved for RAM or CPU when the process becomes non-responsive? Are there significant WAL flushes in the logs? E.g. |
One side note, given your CQ syntax you are downsampling your raw data into measurements in the same retention policy. If you intend to keep all data for the same length of time that's fine, but the canonical use case is to downsample into a different retention policy so that the raw data can be auto-expired sooner than the various downsampled measurements. |
The python client is not using the line protocol currently. however line protocol was just added in so I think I can switch my application over to it without to much trouble. I will do this and see if this improves the situation.
no ram and cpu usage seem to be fine but I will double check that
there are wal flushes but they only happen when I restart the server and I usually get somthing like
however the wall flush interval is every 10m and the db becomes unresponsive before that amount of time passes.
As far as down sampling thanks for the heads up I was planning on it but since these are mainly just tests I haven't done it yet. Thanks for all the feedback and help I will switch over to line protocol and double check my ram and cpu usage and get back to you. |
@beckettsean Also here are is the memory on my machine:
cpu usage never got above 2% and there is plenty of disk space. Also even if the server is stopped and started again even if I don't write any data to it and just make query requests it eventually locks up. And if I do write data to it after just a restart I can only do like 2-10 writes. The only way I can write the initial 400 points again is deleting the database and recreating it. |
@adventures91 can you give a small sample of your write data? I'm trying to repro on my on CentOS box. |
I am just running the equivalent of this curl with the python-client. With the value being a random number between 23 and 100 and writing 3 data points every second each of the 3 data points has a different unit_serial of 1, 2 and 3 |
@beckettsean I tried modifying some of the continuous query settings and also updated to the latest nightly. I changed "recompute-previous-n" to "0" and so far so good at least for the test I was running I have yet to get a lockup. Edit: it wrote 4735 points before locking up again |
Thanks for the additional data, @adventures91. Batching the points should greatly increase your performance, but that would also necessitate using explicit timestamps, since otherwise you'd be generating a lot of duplicate points per batch. It's just a POC so no need to try very hard to optimize it, but we do want to figure out why the write performance degrades so significantly even with non-optimized throughput. 0.9.2 final will drop later today. Please install that and try your stress test again, meanwhile I'll see if the core devs have any ideas on what might be at issue here. |
@beckettsean ok thanks for the help and feedback I will install v0.9.2 when it is released and see if the situation is improved. |
@beckettsean I haven't run v0.9.2 yet since it is still a release candidate but I ran the latest nightly and I again got much further but influxdb still locked up given enough time. This time it wrote 13087 points before locking up. In the logs I got a timeout errors when it try's to run the 30min continuous query. |
Thanks, @adventures91. We've found some locking issues with high write volume and periodic queries that seem to be causing this situation. It wasn't simple enough to fix before releasing 0.9.2, but we should get a handle on it soon. Thanks for the testing, for now I think we have all the data we need, esp since we have a repro case. Although your write volume isn't high, the write frequency is (3 per second) and that seems to lead to the deadlocking problem. |
@beckettsean after a lot of the locking bug fixes and moving wal out of bolt db it seems that this issue is resolved for me, at least with this test. I have written 80k points so far without any problems. Did you want me to close this issue or keep it open for now? |
@adventures91 thanks much for the update. Let's leave this open for a few more days and I'll close it next week if no one else runs into this issue. |
Sounds like we can close this @beckettsean ? |
I have the latest nightly of influx running on a machine with 2GB of Ram, running SSD drives and running CentOS 7 x64. After adding only 402 points at 3 points per second the influxdb service becomes unresponsive. No errors are thrown in the logs. If I stop and restart the influxdb service at this point I can add anywhere from 2 - 10 more points and then influxdb becomes unresponsive again.
If I look at the time in the logs the failed write exceeded the timeout of 5 seconds hence why the db becomes unresponsive.
The config file is the default config and I have three continuous queries that run every 10m, 30m, 1h, and 1w taking the mean of the data over those time periods. If I disable the continuous queries the influxdb seems to stay responsive for a longer period of time.
I am sorry if this is a duplicate issue but I tried to go through the issues list and find a similar issue and could not find one.
The text was updated successfully, but these errors were encountered: