check threshold before a job is reported to Rollbar #319

pacoguzman · 2015-10-14T18:47:06Z

This PR implements a simple Sidekiq threshold similar to the existing for DelayedJob (#318).

If you want to reduce the noise of your Rollbar inbox for Sidekiq jobs that fails only on the first attempt but not the next ones you can define in the Rollbar configuration this threshold so only notify from a number of retry of the current failing job.

By default is disabled the threshold is 0. And for jobs that are not going to be retried we notify always to Rollbar.

Any feedback is welcome, thanks in advance

jondeandres · 2015-10-14T18:50:31Z

lib/rollbar/sidekiq.rb

@@ -5,6 +5,8 @@ class Sidekiq
    PARAM_BLACKLIST = %w[backtrace error_backtrace error_message error_class]

    def self.handle_exception(msg_or_context, e)
+      return if msg_or_context.is_a?(Hash) && msg_or_context["retry"] && msg_or_context["retry_count"] < ::Rollbar.configuration.sidekiq_threshold


Can you move this to a new method? Something like skip_report? or similar.

jondeandres · 2015-10-14T18:50:53Z

Thanks for this. Everything looks nice, just one comment in #319 (comment).

pacoguzman · 2015-10-15T07:20:17Z

Opps we need to keep raising the exception in any case so the job is retried, on it

pacoguzman · 2015-10-15T07:21:59Z

Forgot my previous comment is not true this is only how we handle the exception to report to rollbar, everything seems fine. I just extracted the code to a method as you suggested

jondeandres · 2015-10-15T17:52:02Z

@pacoguzman yes, on the #call method we reraise the exception, what should be still done. I think the PR is fine 😄. We'll merge it in next days.

Thanks!

pacoguzman · 2015-10-16T04:58:58Z

Cool you're welcome

jondeandres · 2015-10-19T23:43:09Z

@pacoguzman one doubt about this, since now I'm not sure this is the behavior you expected.

First fail on a job will be always reported to Rollbar, since the msg_or_context doesn't have retry set to true.

After the first report, Rollbar reports will be disabled for repeated job until reach the threshold, 0 by default. So for a job failing 5 times and a threshold of 3 we will have:

1 fail - reported (retry false)
2 fail - skipped (retry_count 1)
3 fail - skipped (retry_count 2)
4 fail - reported (retry_count 3)
5 fail - reported (retry_count 4)

Is this the expected behavior?

Thanks.

pacoguzman · 2015-10-20T16:59:13Z

I'll check to be 100% sure, but I think Sidekiq alwyas provide retry key and when retry key is present the retry_count even for the first attempt. We saw first errors on Rollbar with that value (retry_count = 0)

I'll confirm asap

jondeandres · 2015-10-20T17:25:34Z

Yes, Sidekiq will provide key retry for first attempt with value false. So with this logic:

msg_or_context.is_a?(Hash) && msg_or_context["retry"] && msg_or_context["retry_count"] < ::Rollbar.configuration.sidekiq_threshold

For the first attempt it's always false, that's ok.

My question was about if you expect to not skip the irst report and then skip reports until reach the threshold.

pacoguzman · 2015-10-20T18:29:49Z

No that's not the expected behaviour sorry my fault. I'll amend the changes
On Oct 20, 2015 7:25 PM, "Jon" notifications@github.com wrote:

Yes, Sidekiq will provide key retry for first attempt with value false.
So with this logic:

msg_or_context.is_a?(Hash) && msg_or_context["retry"] &&
msg_or_context["retry_count"] < ::Rollbar.configuration.sidekiq_threshold

For the first attempt it's always false, that's ok.

My question was about if you expect to not skip the irst report and
then skip reports until reach the threshold.

—
Reply to this email directly or view it on GitHub
#319 (comment).

pacoguzman · 2015-11-02T17:11:37Z

I think the behaviour you described above is not what is happening. For example

class RetryableWorker
  include Sidekiq::Worker

  sidekiq_options retry: true

  def perform
  end
end

1 fail - skipped (retry true)
2 fail - skipped (retry true retry_count 1)
3 fail - skipped (retry true retry_count 2)
4 fail - reported (retry true retry_count 3)
5 fail - reported (retry true retry_count 4)

class NoneRetryableWorker
  include Sidekiq::Worker

  sidekiq_options retry: false

  def perform
  end
end

1 fail - reported (retry false)

And this is the expected behaviour, what do you think?

jondeandres · 2015-11-09T09:06:18Z

mmm @pacoguzman, I think you are right yes 😄, sorry for the confusion.

let me review this again and we'll merge it.

Thanks!

aselder · 2015-11-23T22:20:00Z

This PR has a bug and should be fixed or reverted: #345

Fixes a bug where the first failure of a Sidekiq would always get reported regardless of the `sidekiq_threshold` setting, because then `job_hash['retry_count']` is not set. Looking at rollbar#319, the current behaviour was definitely not the expected one and I believe it wasn't the original one, but the bug was introduced by a breaking change in Sidekiq that wasn't accounted for in this gem. Back when this feature was implemented, in rollbar#319, Sidekiq retry mechanism was implemented as a Sidekiq middleware, so it would run downstream of the Rollbar middleware. When an error would bubble up, it would first encounter the Sidekiq `retry_job` middleware, which would set/increment the job's `retry_count` ([code](https://github.com/mperham/sidekiq/blob/d7d000465cd086160843fe95b8836b22d67578b6/lib/sidekiq/middleware/server/retry_jobs.rb#L107-L113) triggered [here](https://github.com/mperham/sidekiq/blob/d7d000465cd086160843fe95b8836b22d67578b6/lib/sidekiq/middleware/server/retry_jobs.rb#L83)), reraise, and only then would the exception meet the Rollbar middleware, with the current job's `retry_count` well set. Then 2 years ago, sidekiq/sidekiq#3235 introduced Sidekiq 5.0.0, and with it a breaking change: the `retry_job` would not be a middleware anymore but part of the Sidekiq processor. So it would run upstream of the Sidekiq middleware chain, including the Rollbar middleware, meaning that when an error would bubble up, it would meet the Rollbar middleware before Sidekiq's `retry_job` had set the correct `retry_count`. You'll note that I'm changing a spec about how the middleware should behave when `retry_count` is not set. That spec was introdued in rollbar#346 as a way to fix a `nil` error. It doesn't look like the intent was to actually improve the behaviour of the `sidekiq_threshold` and it seems the contributor may not have been using the feature, just working around a bug that it was causing. Anyway the behaviour specified doesn't make sense to me, and contradicts the expected behaviour described in the original implementation's PR (rollbar#346). If `retry` is truthy and `retry_count` is falsy, the behaviour should not be to ignore the threshold, but to enforce by considering that `retry_count` is 0. Otherwise we get that bug where the first failure gets reported to Rollbar regardless of the configured threshold, before then the next n-1 next failures get rightfully ignored.

pacoguzman mentioned this pull request Oct 14, 2015

Sidekiq report threshold #318

Closed

jondeandres reviewed Oct 14, 2015
View reviewed changes

pacoguzman force-pushed the sidekiq-threshold branch from 0c9f31f to 1a72043 Compare October 15, 2015 07:18

check threshold before a job is reported to Rollbar

e7784a7

pacoguzman force-pushed the sidekiq-threshold branch from 1a72043 to e7784a7 Compare October 15, 2015 10:31

jondeandres merged commit e7784a7 into rollbar:master Nov 9, 2015

ejoubaud mentioned this pull request Jul 2, 2018

Fix sidekiq_threshold ignored on first failure #761

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check threshold before a job is reported to Rollbar #319

check threshold before a job is reported to Rollbar #319

pacoguzman commented Oct 14, 2015

jondeandres Oct 14, 2015

jondeandres commented Oct 14, 2015

pacoguzman commented Oct 15, 2015

pacoguzman commented Oct 15, 2015

jondeandres commented Oct 15, 2015

pacoguzman commented Oct 16, 2015

jondeandres commented Oct 19, 2015

pacoguzman commented Oct 20, 2015

jondeandres commented Oct 20, 2015

pacoguzman commented Oct 20, 2015

pacoguzman commented Nov 2, 2015

jondeandres commented Nov 9, 2015

aselder commented Nov 23, 2015

check threshold before a job is reported to Rollbar #319

check threshold before a job is reported to Rollbar #319

Conversation

pacoguzman commented Oct 14, 2015

jondeandres Oct 14, 2015

Choose a reason for hiding this comment

jondeandres commented Oct 14, 2015

pacoguzman commented Oct 15, 2015

pacoguzman commented Oct 15, 2015

jondeandres commented Oct 15, 2015

pacoguzman commented Oct 16, 2015

jondeandres commented Oct 19, 2015

pacoguzman commented Oct 20, 2015

jondeandres commented Oct 20, 2015

pacoguzman commented Oct 20, 2015

pacoguzman commented Nov 2, 2015

jondeandres commented Nov 9, 2015

aselder commented Nov 23, 2015