-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Live reload of rules breaks cluster and requires restart #380
Comments
Hey @jwilm - thank you for reporting the issue. We'll look into it and report back our findings. |
This just hit us too. We changed one field in the Refinery configuration file—mounted from a Kubernetes ConfigMap. After the mounted file changed, Refinery stopped sending all trace data to Honeycomb. The Refinery container's logs showed repeated complaints like the following:
Replacing all the Refinery pod appeared to fix the problem, though restarting each Refinery container in the existing pods may have been sufficient. [Time passes ....] No, replacing the pods is not always sufficient. I noticed that that error message appears to be Refinery attempting to connect to one of its peers. Perhaps it's reading a stale set of peers from Redis. |
In Refinery's transit/transit.go file, there's the method When that method is invoked, it calls on libhoney-go's Both @maplebed and @tredman worked on this section. Could either of you say whether this d.builder = d.LibhClient.NewBuilder()
d.builder.APIHost = upstreamAPI Or maybe:
That is, it looks like the intention was to replace the current Now, over in file collect/collect.go, we register the callback Neither of these are quite the smoking gun I hoped to find, but this feels like the area in which we should continue looking. |
@seh that transit reload definitely looks sus, but I don't believe that's the cause of the live reload woes. I was able to narrow it down to the metrics reload -- if I take that out, things don't break when updating the config maps. I don't know exactly what causes the runaway goroutines, but there are 3 of those per metrics client (of which there are also 3: one for each libhoney client in use).
Never mind, those are two different viper instances. 😢 The search continues... |
Got to the bottom of it: we are acquiring 2 read locks in the same goroutine where a write lock may happen in between. So we get a deadlock, and everything stalls out. I noticed that viper detects a single config change (e.g. rules) as a change to both file handles (config and rules), so that's why we get 2 change notifications in k8s >> another goroutine acquiring locks >> better chance at deadlock. I don't know if that's something that viper doesn't handle well, or a recent change to k8s file system 🤷♀️ Either way, we should use locks correctly. Code details in the upcoming PR. |
We are seeing our Refinery cluster grind to a halt after reloading our rules. I ran a test to collect some data that might be useful for this report. Around 1426 PDT, I deployed the updated rules. At 1433 PDT, I restarted the deployment. Here are some of the observed behaviors:
Upstream responses drops to near 0

Peer communications stop

Goroutines climb indefinitely

Versions
Steps to reproduce
Additional context
Here's the rules based config we are testing with
I suspect the key parts of our rules are that the traces we are forcing to a deteministic sample rate of 100% represent a small part of our overall volume, and the TotalThroughputSampler handles most of our trace data.
I realized after doing this test to collect the data that we are a patch version off of latest, but it looks like the latest release doesn't address this issue. Please let me know if you'd like me to retest with the latest release.
Thanks for your consideration!
The text was updated successfully, but these errors were encountered: