Live reload of rules breaks cluster and requires restart #380

jwilm · 2022-01-21T22:51:29Z

We are seeing our Refinery cluster grind to a halt after reloading our rules. I ran a test to collect some data that might be useful for this report. Around 1426 PDT, I deployed the updated rules. At 1433 PDT, I restarted the deployment. Here are some of the observed behaviors:

Upstream responses drops to near 0

Peer communications stop

Goroutines climb indefinitely

Versions

Go: Whatever the 1.8.0 refinery image was built with
Refinery: 1.8.0

Steps to reproduce

Use rules based config
Change the rules in any way whatsoever, even just a name change
Observe cluster breaking

Additional context

Here's the rules based config we are testing with

    rules.yaml: |
      SampleRate: 1
      Sampler: DeterministicSampler
      production:
        Sampler: RulesBasedSampler
        rule:
        - SampleRate: 1
          condition:
          - field: name
            operator: =
            value: sidekiq.job
          - field: name
            operator: =
            value: notification delivery
          - field: notification.created_source
            operator: =
            value: dashboard
          name: Sidekiq jobs from Dashboard
        - SampleRate: 1
          condition:
          - field: service.name
            operator: =
            value: journeyx
         name: Journeyx Traces 100%
        - sampler:
            TotalThroughputSampler:
              AddSampleRateKeyToTrace: true
              AddSampleRateKeyToTraceField: meta.sampler.key
              ClearFrequencySec: 60
              FieldList:
              - http.route
              - http.status_code
              - http.method
              - exception.type
              - meta.sampler.taint
              GoalThroughputPerSec: 1
              UseTraceLength: false

I suspect the key parts of our rules are that the traces we are forcing to a deteministic sample rate of 100% represent a small part of our overall volume, and the TotalThroughputSampler handles most of our trace data.

I realized after doing this test to collect the data that we are a patch version off of latest, but it looks like the latest release doesn't address this issue. Please let me know if you'd like me to retest with the latest release.

Thanks for your consideration!

MikeGoldsmith · 2022-02-01T10:16:50Z

Hey @jwilm - thank you for reporting the issue. We'll look into it and report back our findings.

seh · 2022-02-07T15:41:40Z

This just hit us too. We changed one field in the Refinery configuration file—mounted from a Kubernetes ConfigMap. After the mounted file changed, Refinery stopped sending all trace data to Honeycomb. The Refinery container's logs showed repeated complaints like the following:

time="2022-02-07T15:29:42Z" level=error msg="error when sending event" api_host= dataset= error="Post \"http://100.114.225.178:8081/1/batch/Infrastructure%20Operations\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" event_type= status_code=0 target=

Replacing all the Refinery pod appeared to fix the problem, though restarting each Refinery container in the existing pods may have been sufficient.

[Time passes ....]

No, replacing the pods is not always sufficient. I noticed that that error message appears to be Refinery attempting to connect to one of its peers. Perhaps it's reading a stale set of peers from Redis.

seh · 2022-02-07T22:51:09Z

In Refinery's transit/transit.go file, there's the method (*DefaultTransmission).reloadTransmissionBuilder, registered as a callback for when the configuration changes.

When that method is invoked, it calls on libhoney-go's (*Client).NewBuilder. That method clones the Builder, but at the call site we drop the pointer to this fresh clone. Though the clone has some values copied over from the original Builder, it doesn't appear to share enough of its guts by reference to be useful merely for the side effect of having created it.

Both @maplebed and @tredman worked on this section. Could either of you say whether this reloadTransmissionBuilder method should have ended more like this?

  d.builder = d.LibhClient.NewBuilder()
  d.builder.APIHost = upstreamAPI

Or maybe:

  builder := d.LibhClient.NewBuilder()
  builder.APIHost = upstreamAPI
  d.builder = builder

That is, it looks like the intention was to replace the current Builder held by the DefaultTransmission with a new one. Instead, if I'm reading this correctly, we create a new Builder, but then drop it. I don't think that that alone would stop the original Builder from continuing to work as it had been. In our case we're not changing the API host, so missing that update wouldn't matter for reproducing our failure today.

Now, over in file collect/collect.go, we register the callback (*InMemCollector).sendReloadSignal. When that method is invoked, it sends a value to the reload channel. Elsewhere, when we notice that this channel is available for receiving, we reset the InMemCollector.datasetSamplers field to an empty map. As it's just a cache, we build that map up again later as we go looking for samplers for particular datasets.

Neither of these are quite the smoking gun I hoped to find, but this feels like the area in which we should continue looking.

vreynolds · 2022-02-08T03:01:17Z

@seh that transit reload definitely looks sus, but I don't believe that's the cause of the live reload woes.

I was able to narrow it down to the metrics reload -- if I take that out, things don't break when updating the config maps. I don't know exactly what causes the runaway goroutines, but there are 3 of those per metrics client (of which there are also 3: one for each libhoney client in use).

However, in testing this, I noticed that each config change was triggered twice 🤨 because we, apparently, subscribe to file changes twice Removing the double file watch fixed the issue for me locally.

Never mind, those are two different viper instances. 😢 The search continues...

vreynolds · 2022-02-09T23:06:04Z

Got to the bottom of it: we are acquiring 2 read locks in the same goroutine where a write lock may happen in between. So we get a deadlock, and everything stalls out.

I noticed that viper detects a single config change (e.g. rules) as a change to both file handles (config and rules), so that's why we get 2 change notifications in k8s >> another goroutine acquiring locks >> better chance at deadlock. I don't know if that's something that viper doesn't handle well, or a recent change to k8s file system 🤷‍♀️ Either way, we should use locks correctly.

Code details in the upcoming PR.

jwilm added the type: bug Something isn't working label Jan 21, 2022

vreynolds self-assigned this Feb 8, 2022

vreynolds mentioned this issue Feb 10, 2022

fix: deadlock when reloading configs #398

Merged

vreynolds closed this as completed in #398 Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Live reload of rules breaks cluster and requires restart #380

Live reload of rules breaks cluster and requires restart #380

jwilm commented Jan 21, 2022

MikeGoldsmith commented Feb 1, 2022

seh commented Feb 7, 2022 •

edited

Loading

seh commented Feb 7, 2022

vreynolds commented Feb 8, 2022 •

edited

Loading

vreynolds commented Feb 9, 2022

Live reload of rules breaks cluster and requires restart #380

Live reload of rules breaks cluster and requires restart #380

Comments

jwilm commented Jan 21, 2022

MikeGoldsmith commented Feb 1, 2022

seh commented Feb 7, 2022 • edited Loading

seh commented Feb 7, 2022

vreynolds commented Feb 8, 2022 • edited Loading

vreynolds commented Feb 9, 2022

seh commented Feb 7, 2022 •

edited

Loading

vreynolds commented Feb 8, 2022 •

edited

Loading