Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous Queries causing 500 Timeouts #3368

Closed
jhedlund opened this issue Jul 17, 2015 · 9 comments
Closed

Continuous Queries causing 500 Timeouts #3368

jhedlund opened this issue Jul 17, 2015 · 9 comments

Comments

@jhedlund
Copy link

I have run into an issue where after a minute or two of writes, I start getting 500 Timeout errors and the database becomes unresponsive (similar to issue #3199, but I am not using collectd, and also similar to issue #3362 (I am not crashing, but I may not be leaving my service up long enough to see the out of memory / crash).

I am using the HTTP POST protocol, posting in batches of 50 (though it occurred with smaller batches as well). I post about a 1000 data points per minute (I am running the POST at the top of every minute, pushing the ~1000 data points in batches of 50).

It would fail in the 2nd minute almost everytime. No bump in CPU, and a lot of free memory.

In issue #3346 (similar again, but doesn't mention any 500 errors), there were some questions around having continous queries.

I tried disabling my one continuous query and the problem has so far gone away (been running about 20 minutes so far without any 500 errors, all 204).

Can I provide more data to help diagnose the problem?

This is the continuous query:
CREATE CONTINUOUS QUERY ohlc_1m ON prices BEGIN
SELECT first(last) as open, max(bid) as high, min(ask) as low, last(last) as close INTO ohlc FROM hour.ticks GROUP BY time(1m), *
END

Thanks,
Jeff

@beckettsean
Copy link
Contributor

There appear to be performance issues with large writes and concurrent continuous queries. I think you've linked to the relevant issues, and it's informative to know that you are seeing this behavior without the graphite or collectd plugin.

@jhedlund
Copy link
Author

Thanks Sean. Any idea if there is a workaround to still have the downsampling in a continuous query?

@beckettsean
Copy link
Contributor

The only facility for downsampling is continuous queries. They are due for significant work in the 0.9.3 version, so my best advice is limp along until August 13th, if you can.

There are CQ tuning parameters that are still poorly documented but the names make them fairly intuitive. Have you experimented with these settings in the config file?

[continuous_queries]
  enabled = true
  recompute-previous-n = 2
  recompute-no-older-than = "10m"
  compute-runs-per-interval = 10
  compute-no-more-than = "2m"

@jhedlund
Copy link
Author

Is there some extra logging I can turn on to see what might be causing the problem of the continuous query?

Maybe some of that logging would point me in the direction of how to modify those settings...

@beckettsean
Copy link
Contributor

@jhedlund I'm unaware of any different log levels right now. The internals have changed enough that the self-diagnostics and monitoring are being redone from scratch, so we might not know what's happening for another point release or two.

@jhedlund
Copy link
Author

Ok, thanks.

I change recompute-previous-n to 0 to see if that made a difference (after reading a bit about them here https://github.com/influxdb/influxdb/blob/bf219cad358637b7771eced94a9ad0a7b5fa4b80/services/continuous_querier/config.go )

Did not make any difference. I have also tried CQs of similar nature using time(2m) and (5m) - both fail with the 500 timeouts after about a minute.

I'm trying time(1h) right now and it is so far holding up... I'll let it roll longer to see if it gets into the same situation, but so far so good.

I still need the 1m, 2m, 5m, etc, but I can possibly just query them from the raw table for a while. I'm going to see how low I can get the rollup in minutes to go.

Maybe some of that provides some clues for where the issue is...

Thanks
Jeff

@jhedlund
Copy link
Author

Update: The group by time (1h) started to fail at about the 2nd hour mark with 500 errors.... eventually locking up queries on the database as well.

@dim
Copy link
Contributor

dim commented Jul 30, 2015

I managed to recreate the problem in #3517

@beckettsean beckettsean modified the milestones: 0.9.4, 0.9.3 Aug 6, 2015
@otoolep
Copy link
Contributor

otoolep commented Sep 9, 2015

We have improved CQ performance and believe this issue has been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants