Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.2-nightly] Becomes unresponsive after a small number of insertions #3346

Closed
eoltean opened this issue Jul 16, 2015 · 19 comments
Closed

Comments

@eoltean
Copy link

eoltean commented Jul 16, 2015

I have the latest nightly of influx running on a machine with 2GB of Ram, running SSD drives and running CentOS 7 x64. After adding only 402 points at 3 points per second the influxdb service becomes unresponsive. No errors are thrown in the logs. If I stop and restart the influxdb service at this point I can add anywhere from 2 - 10 more points and then influxdb becomes unresponsive again.

If I look at the time in the logs the failed write exceeded the timeout of 5 seconds hence why the db becomes unresponsive.

The config file is the default config and I have three continuous queries that run every 10m, 30m, 1h, and 1w taking the mean of the data over those time periods. If I disable the continuous queries the influxdb seems to stay responsive for a longer period of time.

I am sorry if this is a duplicate issue but I tried to go through the issues list and find a similar issue and could not find one.

@gunnaraasen
Copy link
Member

Thanks for the report @adventures91.

Are there any continuous query errors in the logs?

Can you provide examples of the continuous queries being run and the points being written?

@eoltean
Copy link
Author

eoltean commented Jul 16, 2015

"CREATE CONTINUOUS QUERY temp_data_10m ON temp_statistics BEGIN SELECT mean(value) INTO "10m.temp_data" FROM temp_data WHERE time > now() - 10m GROUP BY time(10m), data_type, unit_serial END"

the other queries of 30m, 1h, and 1w are pretty much the same.

I am writing the data using the influxdb-python client and the data looks like this:
{'fields': {'value': 32.0}, 'tags': {'unit_serial': u'1', 'data_type': 'temp'}, 'measurement': 'temp_data'}]

The only continuous query errors are at the very beginning of the log file "error: measurement not found" I assumed this was expected since the measurements do not exist yet.

I can query the measurements created by the continuous queries while the db is still responsive and get the correct results

when running the the same influxdb server on my local computer which has 8gb of ram, an ssd, and fedora 20 and running the same queries I have not yet had influxdb become unresponsive. Although I have not run this test for very long on my local machine.

@beckettsean
Copy link
Contributor

@adventures91 as of 0.9.1 the bug requiring a WHERE time clause for CQs was fixed. Because the CQ parser dynamically inserts the correct WHERE time clause at runtime you should not provide them when creating the CQ. I don't know if that's the actual issue, but it makes sense to clean up the potential syntax issues before we dig too much further.

Can you paste the output of SHOW CONTINUOUS QUERIES?

@eoltean
Copy link
Author

eoltean commented Jul 16, 2015

temp_data_10m:  "CREATE CONTINUOUS QUERY  temp_data_10m ON temp_statistics BEGIN SELECT mean(value) INTO "temp_statistics"."default"."10m.temp_data" FROM "temp_statistics"."default".temp_data WHERE time > now() - 10m GROUP BY time(10m), data_type, unit_serial END"

temp_data_30m:  "CREATE CONTINUOUS QUERY temp_data_30m ON temp_statistics BEGIN SELECT mean(value) INTO "temp_statistics"."default"."30m.temp_data" FROM "temp_statistics"."default".temp_data WHERE time > now() - 30m GROUP BY time(30m), data_type, unit_serial END"

temp_data_1h:    "CREATE CONTINUOUS QUERY temp_data_1h ON temp_statistics BEGIN SELECT mean(value) INTO "temp_statistics"."default"."1h.temp_data" FROM "temp_statistics"."default".temp_data WHERE time > now() - 1h GROUP BY time(1h), data_type, unit_serial END"

temp_data_1w:    "CREATE CONTINUOUS QUERY temp_data_1w ON temp_statistics BEGIN SELECT mean(value) INTO "temp_statistics"."default"."1w.temp_data" FROM "temp_statistics"."default".temp_data WHERE time > now() - 1w GROUP BY time(1w), data_type, unit_serial END"

I can update the continuous queries and get back to you if the end result differs

@eoltean
Copy link
Author

eoltean commented Jul 16, 2015

ok here is the list of updated continuous queries:

temp_data_10m   "CREATE CONTINUOUS QUERY temp_data_10m ON temp_statistics BEGIN SELECT mean(value) INTO "temp_statistics"."default"."10m.temp_data " FROM "temp_statistics"."default".temp_data  GROUP BY time(10m), data_type, unit_serial END"

temp_data_30m   "CREATE CONTINUOUS QUERY temp_data_30m ON temp_statistics BEGIN SELECT mean(value) INTO "temp_statistics"."default"."30m.temp_data " FROM "temp_statistics"."default".temp_data  GROUP BY time(30m), data_type, unit_serial END"

temp_data_1h    "CREATE CONTINUOUS QUERY temp_data_1h ON temp_statistics BEGIN SELECT mean(value) INTO "temp_statistics"."default"."1h.temp_data " FROM "temp_statistics"."default".temp_data  GROUP BY time(1h), data_type, unit_serial END"

temp_data_1w    "CREATE CONTINUOUS QUERY temp_data_1w ON temp_statistics BEGIN SELECT mean(value) INTO "temp_statistics"."default"."1w.temp_data " FROM "temp_statistics"."default".temp_data  GROUP BY time(1w), data_type, unit_serial END"

still having the same issue I get close to around 400 writes then the db becomes unresponsive

@beckettsean
Copy link
Contributor

Thanks for cleaning up the CQs, @adventures91. The new ones have correct syntax so we can likely rule out any issues with that.

The only continuous query errors are at the very beginning of the log file "error: measurement not found" I assumed this was expected since the measurements do not exist yet.

Correct, that's expected when CQs try to downsample a measurement that has no data yet.

I am writing the data using the influxdb-python client and the data looks like this:
{'fields': {'value': 32.0}, 'tags': {'unit_serial': u'1', 'data_type': 'temp'}, 'measurement': 'temp_data'}]

I think this might be a significant contributor to the issue. The JSON protocol is known to have performance issues, hence it's deprecation in favor of the line protocol. If the python client is inserting JSON and is not batching points, that's creating the worst possible write performance. That, coupled with CQs downsampling, might be leading to resource constraints.

Does the box seem to be starved for RAM or CPU when the process becomes non-responsive?

Are there significant WAL flushes in the logs? E.g. [shard] 2015/07/16 12:06:16 flush 1260 points in 0.006s

@beckettsean
Copy link
Contributor

One side note, given your CQ syntax you are downsampling your raw data into measurements in the same retention policy. If you intend to keep all data for the same length of time that's fine, but the canonical use case is to downsample into a different retention policy so that the raw data can be auto-expired sooner than the various downsampled measurements.

@eoltean
Copy link
Author

eoltean commented Jul 16, 2015

If the python client is inserting JSON and is not batching points, that's creating the worst possible write performance. That, coupled with CQs downsampling, might be leading to resource constraints.

The python client is not using the line protocol currently. however line protocol was just added in so I think I can switch my application over to it without to much trouble. I will do this and see if this improves the situation.

Does the box seem to be starved for RAM or CPU when the process becomes non-responsive?

no ram and cpu usage seem to be fine but I will double check that

Are there significant WAL flushes in the logs

there are wal flushes but they only happen when I restart the server and I usually get somthing like

[metastore] 2015/07/16 13:09:41 [INFO] raft: Node at 127.0.0.1:8088 [Follower] entering Follower state
[metastore] 2015/07/16 13:09:41 read local node id: 1
[shard] 2015/07/16 13:09:41 flush 2 points in 0.001s
[shard] 2015/07/16 13:09:41 flush 4 points in 0.002s
[shard] 2015/07/16 13:09:41 flush 2 points in 0.001s
[shard] 2015/07/16 13:09:41 flush 4 points in 0.001s
[httpd] 2015/07/16 13:09:41 listening on HTTP: [::]:8086

however the wall flush interval is every 10m and the db becomes unresponsive before that amount of time passes.

One side note, given your CQ syntax you are downsampling your raw data into measurements in the same retention policy.

As far as down sampling thanks for the heads up I was planning on it but since these are mainly just tests I haven't done it yet.

Thanks for all the feedback and help I will switch over to line protocol and double check my ram and cpu usage and get back to you.

@eoltean
Copy link
Author

eoltean commented Jul 16, 2015

@beckettsean
Ok so I switched over to using the line protocol and still am having the same issue I get to around 400 writes and the db locks up.

Also here are is the memory on my machine:

MemTotal:        2050060 kB
MemFree:          440420 kB
MemAvailable:    1613304 kB

cpu usage never got above 2% and there is plenty of disk space.

Also even if the server is stopped and started again even if I don't write any data to it and just make query requests it eventually locks up. And if I do write data to it after just a restart I can only do like 2-10 writes. The only way I can write the initial 400 points again is deleting the database and recreating it.

@beckettsean
Copy link
Contributor

@adventures91 can you give a small sample of your write data? I'm trying to repro on my on CentOS box.

@eoltean
Copy link
Author

eoltean commented Jul 16, 2015

@beckettsean

curl -i -XPOST 'http://localhost:8086/write?db=temp_statistics' --data-binary 'temp_data,data_type=temp,unit_serial=1 value=80.0'

I am just running the equivalent of this curl with the python-client. With the value being a random number between 23 and 100 and writing 3 data points every second each of the 3 data points has a different unit_serial of 1, 2 and 3

@eoltean
Copy link
Author

eoltean commented Jul 21, 2015

@beckettsean I tried modifying some of the continuous query settings and also updated to the latest nightly. I changed "recompute-previous-n" to "0" and so far so good at least for the test I was running I have yet to get a lockup.

Edit: it wrote 4735 points before locking up again

@beckettsean
Copy link
Contributor

Thanks for the additional data, @adventures91. Batching the points should greatly increase your performance, but that would also necessitate using explicit timestamps, since otherwise you'd be generating a lot of duplicate points per batch. It's just a POC so no need to try very hard to optimize it, but we do want to figure out why the write performance degrades so significantly even with non-optimized throughput.

0.9.2 final will drop later today. Please install that and try your stress test again, meanwhile I'll see if the core devs have any ideas on what might be at issue here.

@eoltean
Copy link
Author

eoltean commented Jul 23, 2015

@beckettsean ok thanks for the help and feedback I will install v0.9.2 when it is released and see if the situation is improved.

@eoltean
Copy link
Author

eoltean commented Jul 24, 2015

@beckettsean I haven't run v0.9.2 yet since it is still a release candidate but I ran the latest nightly and I again got much further but influxdb still locked up given enough time. This time it wrote 13087 points before locking up.

In the logs I got a timeout errors when it try's to run the 30min continuous query.

@beckettsean
Copy link
Contributor

Thanks, @adventures91. We've found some locking issues with high write volume and periodic queries that seem to be causing this situation. It wasn't simple enough to fix before releasing 0.9.2, but we should get a handle on it soon. Thanks for the testing, for now I think we have all the data we need, esp since we have a repro case. Although your write volume isn't high, the write frequency is (3 per second) and that seems to lead to the deadlocking problem.

@beckettsean beckettsean added this to the 0.9.3 milestone Jul 24, 2015
@beckettsean beckettsean modified the milestones: 0.9.4, 0.9.3 Aug 6, 2015
@eoltean
Copy link
Author

eoltean commented Aug 21, 2015

@beckettsean after a lot of the locking bug fixes and moving wal out of bolt db it seems that this issue is resolved for me, at least with this test. I have written 80k points so far without any problems. Did you want me to close this issue or keep it open for now?

@beckettsean
Copy link
Contributor

@adventures91 thanks much for the update. Let's leave this open for a few more days and I'll close it next week if no one else runs into this issue.

@otoolep
Copy link
Contributor

otoolep commented Sep 9, 2015

Sounds like we can close this @beckettsean ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants