-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.9.2] CQ timeout caused http write timeouts. #3469
Comments
I too am getting a timeout in continuous queries [continuous_querier] 2015/07/27 12:46:58 timeout Although in my case all the writes were anyway timing out [http] 2015/07/27 12:46:59 23.101.29.30 - root [27/Jul/2015:12:46:54 +0000] POST /write?u=root&p=root&database=connecto&precision=ms&db=connecto HTTP/1.1 500 8 - - 8f583854-345d-11e5-9330-000000000000 5.0011198s |
Yea, this is the 3rd time in 24hrs it has happened. I finally noticed it in the logs, so I opted to remove that CQ. (I updated to 0.9.2 about 24hrs ago now) Spikes are times http write timeouts :( Restarts seem to help just fine for a little while, but as I mentioned in #3469, I get ungraceful restarts w/ panics. |
@mathurs did your write timeouts start very closely to the CQ timeout? |
No, actually my writes start timing out gradually after server start. I am writing at about a 1000qps and initially the writes are fine but after a few minutes, the latency gradually increases and the writes start to timeout. Once that happens the continuous queries also start to time out, so I don't think the CQ's cause the write timeouts. |
Try adjusting your wal settings I lowered my |
Tried this also and tried reducing other load on the server, doesn't seem to help. Infact, my reads are also frozen now. |
@beckettsean yes. I get the same port already in use and I just have to wait for it to finally die before i can restart to clear the timeouts. I have gotten this error at least 6 times in the since Sunday. I did notice if i removed my CQ that was doing a |
My CQ does not have a regex. It's quite simple:
I am selecting from and writing to the same retention policy, which means reads and writes to the same shards. Does your CQ also write to the same RP from which it pulls data? |
Yes, all my CQ and writes use my new default retention policy. (everything uses the same RP) |
@mathurs are you reading from and writing to the same retention policy in your CQ that times out? Could be we have write/read lock contention at the shard level, exacerbated by constrained resource availability. |
Is there anyway to report metrics on such things currently (gonna guess no after quickly looking through the shard write code) or would it require profiling the app? |
I think most of the profiling the core team does involves attaching a profiler to the running process, nothing baked in and user-facing. Unfortunately @dgnorton @corylanou @pauldix any recommendations for how @jhorwit2 could investigate the CQ locking & timeouts a bit more closely? |
#3488 has another case of this happening. Same experience last night, my CQ timed out just after a WAL flush and all writes were 500s until two restarts later: CQ timeout > 500s on writes. Database is fine for queries but all telegraf writes return 500.
After the hard shutdown, a second restart works just fine, and writes are back.
|
Sorry, for the delay in my response. Yes, the restart log looks very similar. I have to cleanly shutdown first wait for it to reach hard time limit, and then restart. Except, in my case its very predictable. I restart it and within 5-10 minutes it reaches time out stage and everything freezes. @beckettsean, its not just my CQs, but all my writes are also timing out. |
Does anyone notice panics during this restart process? I see this #3468 |
ok, I'm guessing this has to do with how the WAL flushes currently work. We're in the middle of working on a solution for this. Will hopefully make it into master early next week. |
There a ticket on that? I'm curious to see the problem / fix. |
👍 |
Not sure if I'm experiencing the same issue or a variant (with 0.9.2). I am not using continuous queries (after #3362), but seem to be having the same symptoms after running a query (occasionally). normal writes, the sun is shining, all is good:
then, the killer query (with some vars replaced with x/y) - although it's not slow, it takes only ~4ms:
everything 500's after this point, with ~5s response times (presumably a timeout of some kind):
and finally the hard shutdown part:
|
URL decoded version of the query for easier reading:
@nicksellen if you issue those queries individually does it still lead to the 500s and the shutdown? |
I ended up having to turn off all my CQ because of this :(. Now writes never timeout. |
@jhorwit2 CQs will be a significant focus of the 0.9.4 release. There are still a lot of rough edges there. |
@beckettsean ah, this could make sense if having a bunch of queries causes a timeout and triggers the same behaviour as a CQ causing a timeout - and I never had the problem with queries made from grafana (which are all individual). It's happening less frequently now (or maybe not at all) as I am running the queries themselves less frequently (for other reasons) - the sample size would be so small I wouldn't know if the change to separate queries had made a difference. I'll try that though if it increases in frequency again, good suggestion. And to clarify, the shutdown was always manually issued via |
@nicksellen thanks for the clarification on the hard shutdown, I'm glad InfluxDB isn't committing suicide. Something about the locking behavior that leads to the slow queries and stalled writes also seems to stall the shutdown process. I think it's less that the number of queries causes the problem, and more that the deadlocking issue becomes more likely as write volume and number of points queried rise. High enough write volume and a single query might cause the locking. Query enough points and a low write volume might still start timing out. Running four queries on different measurements covers more points than a single one, and maybe issuing them in the same statement makes the locking more likely. In any event, this performance issue is the major focus of the 0.9.3 remaining effort, and the changes to the WAL have dramatically reduced the risk in our testing. Looking forward to your results with 0.9.3 final late next week. |
I would like to share my experience on this thread, i am writing very less data 1 I have tested with count function in cq query resulted not in success, received 500 after some time, it is very likely one of the reason of "500s related with the CQs issue" is because of the count(field) aggregate function? |
@eminden thanks for the report. It is possible that certain aggregations, like |
@beckettsean sorry for not returning back about what you are asking, i am not able to give you concrete steps to reproduce the same, but i did some tests that i would like to share , here are what i found ;
|
|
@jhorwit2 that's correct, 0.9.4 is slated for major CQ work, but 0.9.3 is still focused on clustering. |
|
@eminden thanks for the clarification on #1, I see what you're saying now. It's not that AS wasn't working, it's that if the user doesn't supply aliases for multiple instances of the same function, the inserted results are wrong. Can you open a new issue for that? I don't see one currently, and we should address that in the parser or some other way in the CQs. |
Brett, absolutely InfluxDB recommends using multiple CQs to downsample at multiple intervals to multiple retention policies, as appropriate for your retention plans, but I personally wouldn't call that chaining, since they run concurrently on the data, not consecutively. By "chaining CQs" I mean aggregating raw into 1m, and then aggregating 1m into 15m, and then 15m into 1h, etc. That introduces uncertainties because each successive aggregation is summarizing a summary. Much better to aggregate raw to 1m, raw to 15m, raw to 1h, etc., all running in parallel and using raw data as the source. If you aren't using an aggregation function, but instead a selector like FIRST, LAST, TOP, MAX, etc., then there's no real loss of fidelity from downsampling already downsampled data, although the timestamps will get muddled until #1577 is fixed. |
By your definition, then, we are chaining CQs; we are always downsampling "one step" from one retention policy into a longer one with larger bins. I was under the impression that this was a well-supported use case, based on @pauldix's previous post:
The remaining confusion for me is that, without doing this, we'd have to keep raw data around for the second-longest of all our retention policies. In our case, I think it's 1-week bins for a year, so we would have to keep raw data (10-second bins) for a week, rather than for an hour as we currently do. This defeats a chunk of the space savings of having cascading downsampling. Basically I guess I'm making a request that the race conditions get fixed. :-) I don't really think you're planning to punt that to the user, but this seems like a perfectly reasonable use case given the abilities of (continuous) queries to pull from specific retention policies. |
@brettdh It is a supported use case. I didn't say it was forbidden, just that it had side effects best avoided if possible. Your second point is very valid, you have to ensure that the sampled data persists for at least as long as the GROUP BY interval for the downsampled aggregations. In your case that will necessitate downsampling from already aggregated data. To be perfectly clear, I'm not at all saying you cannot run a CQ on already downsampled data, I'm just saying don't do that if you don't have to. More CQ docs are coming as part of the 0.9.4 release, stay tuned. |
Awesome, thanks for clarifying. 0.9.3 and 0.9.4 look to have a bunch of things that I'm really looking forward to. |
Should be solved with 0.9.3. Please reopen if that isn't the case. |
I'm waiting for 9.4 probably before i try CQ again, but will do. |
I just noticed a CQ timed out randomly after running fine for the last 12 or so hours since I added it.
CQ:
The logs showed only this
Everything after this was constant 5s timeouts on writes.
releaseVersion
has relatively low cardinality (about 20 or so for this query)The text was updated successfully, but these errors were encountered: