Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Live reload of rules breaks cluster and requires restart #380

Closed
jwilm opened this issue Jan 21, 2022 · 5 comments · Fixed by #398
Closed

Live reload of rules breaks cluster and requires restart #380

jwilm opened this issue Jan 21, 2022 · 5 comments · Fixed by #398
Assignees
Labels
type: bug Something isn't working

Comments

@jwilm
Copy link

jwilm commented Jan 21, 2022

We are seeing our Refinery cluster grind to a halt after reloading our rules. I ran a test to collect some data that might be useful for this report. Around 1426 PDT, I deployed the updated rules. At 1433 PDT, I restarted the deployment. Here are some of the observed behaviors:

Upstream responses drops to near 0
image

Peer communications stop
image

Goroutines climb indefinitely
image

Versions

  • Go: Whatever the 1.8.0 refinery image was built with
  • Refinery: 1.8.0

Steps to reproduce

  1. Use rules based config
  2. Change the rules in any way whatsoever, even just a name change
  3. Observe cluster breaking

Additional context

Here's the rules based config we are testing with

    rules.yaml: |
      SampleRate: 1
      Sampler: DeterministicSampler
      production:
        Sampler: RulesBasedSampler
        rule:
        - SampleRate: 1
          condition:
          - field: name
            operator: =
            value: sidekiq.job
          - field: name
            operator: =
            value: notification delivery
          - field: notification.created_source
            operator: =
            value: dashboard
          name: Sidekiq jobs from Dashboard
        - SampleRate: 1
          condition:
          - field: service.name
            operator: =
            value: journeyx
         name: Journeyx Traces 100%
        - sampler:
            TotalThroughputSampler:
              AddSampleRateKeyToTrace: true
              AddSampleRateKeyToTraceField: meta.sampler.key
              ClearFrequencySec: 60
              FieldList:
              - http.route
              - http.status_code
              - http.method
              - exception.type
              - meta.sampler.taint
              GoalThroughputPerSec: 1
              UseTraceLength: false

I suspect the key parts of our rules are that the traces we are forcing to a deteministic sample rate of 100% represent a small part of our overall volume, and the TotalThroughputSampler handles most of our trace data.

I realized after doing this test to collect the data that we are a patch version off of latest, but it looks like the latest release doesn't address this issue. Please let me know if you'd like me to retest with the latest release.

Thanks for your consideration!

@jwilm jwilm added the type: bug Something isn't working label Jan 21, 2022
@MikeGoldsmith
Copy link
Contributor

Hey @jwilm - thank you for reporting the issue. We'll look into it and report back our findings.

@seh
Copy link
Contributor

seh commented Feb 7, 2022

This just hit us too. We changed one field in the Refinery configuration file—mounted from a Kubernetes ConfigMap. After the mounted file changed, Refinery stopped sending all trace data to Honeycomb. The Refinery container's logs showed repeated complaints like the following:

time="2022-02-07T15:29:42Z" level=error msg="error when sending event" api_host= dataset= error="Post \"http://100.114.225.178:8081/1/batch/Infrastructure%20Operations\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" event_type= status_code=0 target=

Replacing all the Refinery pod appeared to fix the problem, though restarting each Refinery container in the existing pods may have been sufficient.

[Time passes ....]

No, replacing the pods is not always sufficient. I noticed that that error message appears to be Refinery attempting to connect to one of its peers. Perhaps it's reading a stale set of peers from Redis.

@seh
Copy link
Contributor

seh commented Feb 7, 2022

In Refinery's transit/transit.go file, there's the method (*DefaultTransmission).reloadTransmissionBuilder, registered as a callback for when the configuration changes.

When that method is invoked, it calls on libhoney-go's (*Client).NewBuilder. That method clones the Builder, but at the call site we drop the pointer to this fresh clone. Though the clone has some values copied over from the original Builder, it doesn't appear to share enough of its guts by reference to be useful merely for the side effect of having created it.

Both @maplebed and @tredman worked on this section. Could either of you say whether this reloadTransmissionBuilder method should have ended more like this?

  d.builder = d.LibhClient.NewBuilder()
  d.builder.APIHost = upstreamAPI

Or maybe:

  builder := d.LibhClient.NewBuilder()
  builder.APIHost = upstreamAPI
  d.builder = builder

That is, it looks like the intention was to replace the current Builder held by the DefaultTransmission with a new one. Instead, if I'm reading this correctly, we create a new Builder, but then drop it. I don't think that that alone would stop the original Builder from continuing to work as it had been. In our case we're not changing the API host, so missing that update wouldn't matter for reproducing our failure today.

Now, over in file collect/collect.go, we register the callback (*InMemCollector).sendReloadSignal. When that method is invoked, it sends a value to the reload channel. Elsewhere, when we notice that this channel is available for receiving, we reset the InMemCollector.datasetSamplers field to an empty map. As it's just a cache, we build that map up again later as we go looking for samplers for particular datasets.

Neither of these are quite the smoking gun I hoped to find, but this feels like the area in which we should continue looking.

@vreynolds
Copy link
Contributor

vreynolds commented Feb 8, 2022

@seh that transit reload definitely looks sus, but I don't believe that's the cause of the live reload woes.

I was able to narrow it down to the metrics reload -- if I take that out, things don't break when updating the config maps. I don't know exactly what causes the runaway goroutines, but there are 3 of those per metrics client (of which there are also 3: one for each libhoney client in use).

However, in testing this, I noticed that each config change was triggered twice 🤨 because we, apparently, subscribe to file changes twice Removing the double file watch fixed the issue for me locally.

Never mind, those are two different viper instances. 😢 The search continues...

@vreynolds vreynolds self-assigned this Feb 8, 2022
@vreynolds
Copy link
Contributor

Got to the bottom of it: we are acquiring 2 read locks in the same goroutine where a write lock may happen in between. So we get a deadlock, and everything stalls out.

I noticed that viper detects a single config change (e.g. rules) as a change to both file handles (config and rules), so that's why we get 2 change notifications in k8s >> another goroutine acquiring locks >> better chance at deadlock. I don't know if that's something that viper doesn't handle well, or a recent change to k8s file system 🤷‍♀️ Either way, we should use locks correctly.

Code details in the upcoming PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants