-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Q] Context Canceled Errors on Large Queries to Carbonserver #424
Comments
So it looks like my timeout in carbonapi was why the query was ending at 4 seconds. I bumped that up to 30 seconds hoping that would solve the problem but that doesn't look like it's the case. Now it just runs for 30 seconds before hitting context canceled. Any ideas on how to troubleshoot this? Like I said, this query runs for several seconds on the python version, never anywhere near 30 seconds. |
Hi @cthiel42 did you tried "fail-on-max-globs = false"? or maybe enable "trie-index" ? |
And for |
I think I've narrowed the issue down to the Running that function locally on my machine takes several minutes with the input I'm throwing at it. I'm going to do some more debugging and figure out what can be done to speed things up. |
Can you share more details on this issue? We have been using trie quite some time and it's doing good.
Can you also share the number of metrics and an example of your queries look like? Would help us identify the issue. |
Oh, and one more thing, what's your version of go-carbon? There should be some more info for logs in slow_expand_globs You can set the logger to debug to have more informations.
|
We're running 0.15.6. Are you using trie with file compression enabled too? Like I said, it's been a bit since I looked at the trie index but I want to say the problem only occurred with file compression enabled. Any queries basically just return no metrics. It effectively behaves like we don't have a single metric under our metrics directory. And as far as the query goes, I think that's part of the problem. It looks like
By my math that means the If I pull each set of those curly braces out and put a wildcard ( |
Nice finding. I think trie-index might be able to avoid calling
I think there is an
Yes, we do have a cluster using both compression and trie index and it's working. Would love to learn about your issue if you want to try it out again in the future. |
Alright I'll give trie index another go. When I run it, these are the kind of logs I get. Basically just doesn't find any metrics.
|
Alright I did some more troubleshooting and found that the filepath walk it does starting here returns no files. I pulled the basics of that code out and ran it on my local machine to test that it walked the directory correctly and it did. I ran that same script on the graphite server's whisper directory and I get nothing. I ran it on a different directory on the graphite server and it listed everything fine. I have my whisper directory symlinked, and it turns out golang's walk function doesn't follow symlinked paths, as specified here I switched my data-dir to the target of that symlink and now I'm getting metrics. Not all of them though, it looks like I'm getting some errors when the walk hits an empty folder. Any thoughts there? Maybe put some error handling in for empty folders? |
Hmm, I think this is shared logics of both trie and trigram index. So in theory, trigram index should also have the same problem. I don't understand why empty folders would causes issue though? Do you mean go-carbon query api doesn't return the folder? (And thanks for the detailed report and testing!) |
Here is a log of the error I'm getting with the empty folders
|
@cthiel42 this looks like there is whisper file with no name But go-carbon should probably just log the error and continue indexing. Will make a patch for this issue. |
Cool, thanks for checking it out! For the empty folder, that's new to me. Not sure what's the cause. How about using graphite API:
Does it return an expected outcome? |
When I do that query (with trie index turned on) it returns all the metrics you see in my screenshot. This doesn't really affect me much but it'd be nice to not have to worry about it. If I shut trie index off, then I get the following response
|
@bom-d-van I wanted to follow up on this since it's causing me a few query issues in our production environment and I saw it got merged into master. I think the issue in the code is somewhere in carbonserver/trie.go, lines 633 - 651.
If I change that back to what it was before the patch, all the weird behavior with empty folders is gone.
If you wanted to replicate this issue you can just run go-carbon off master, enable trie index, and put an empty folder in the whisper data directory. Off the current master branch you'll get something like this: Note that |
I would treat that's as a bug. Empty directory should magically show metrics which actually is not there.
Well, that's debatable. If directory contains no whisper files then it should not be indexed or visible by metric definition IMO. |
hi @cthiel42 , I can't reproduce the issue. I wonder if it's because we are using different versions of graphite dashboard there? I created two empty dir, one is called ddos_nil, one is called nil. They both show as empty dir in the result. We One thing we can also do is make trie to also prune empty dir nodes. |
Is your graphite-web configured to query the file system directly or configured to query carbonserver? I had to set the data directory to a bad path so graphite-web wouldn't query the file system and would instead query the URL's in the cluster server configuration. Possibly an easier alternative would be to just use carbonapi instead of graphite-web. The issue is with go-carbon so the behavior is the same there, and it's easier to manage than graphite-web. And I have a cronjob running to clear out empty folders and stale metrics too. The problem is these will just get created again if you delete the folder since there's a metric with no name getting sent to graphite. |
I'm using carbonserver/go-carbon. I pushed a quick fix here: #434 Maybe you want to see if this could fix your issue. |
actually, no. our graphite-web talks to carbonzipper. I think this might explains why I don't see the empty dir. |
@cthiel42 maybe try the patch that I posted above. |
That patch seems to get rid of the issue in my test environment. I'm going to do some more testing and maybe try it on prd overnight if everything looks good. I'll let you know how I get along. |
@bom-d-van I'm running that patch in our production environment now and everything seems to work as expected. I'll let you know if anything else comes up. Thanks for your help. |
Has there been an updates on this lately? We have a cluster with 3 billion known metrics running on 2x50 r5.2xlarge, and we're seeing these systems OOM out. |
@jdblack I never had any OOM errors. I only got context canceled due to timeouts, which I was able to solve by enabling trie index. Basically without trie index, it would try to build out an absurd amount of metric paths and never finish. This has been resolved and the patch has been merged so I'm going to close this issue. |
@jdblack wow, 3 billion?! are they all in the same instance? or is it a distributed cluster? can you share your config and cluster layout? |
I have a few large queries that are causing me issues. I originally had some larger queries with a lot of wildcards that would cause me errors like this:
I increased the max globs in the config and that seemed to fix the problem for most of the larger queries, but there's still one left that was causing me issues, so I increased the max globs to an absurd value. This time I get a different error though:
My question is what do I need to change to get this query to run. Skimming through the carbonserver code, it looks like this error isn't necessarily related to the value of max globs, but is more like an issue with timeouts. On our python based graphite this query takes about 8 or 9 seconds, and the timeout is set much higher than that in our config.
Our go-carbon setup is running on a 16 core box with EBS volumes for storage, and currently hovering around 8 million metrics. Here is our go-carbon config:
The text was updated successfully, but these errors were encountered: