Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sidekiq_threshold ignored on first failure #761

Merged

Conversation

ejoubaud
Copy link
Contributor

@ejoubaud ejoubaud commented Jul 2, 2018

Summary

Fixes a bug where the first failure of a Sidekiq job would always get reported regardless of the sidekiq_threshold setting, because then job_hash['retry_count'] is not set.

Looking at #319, the current behaviour was definitely not the expected one and I believe it wasn't the original one, but that the bug was introduced by a breaking change in Sidekiq that wasn't accounted for in this gem.

Details

Back when this feature was implemented, in #319, Sidekiq retry mechanism was implemented as a Sidekiq middleware, so it could run downstream of the Rollbar middleware. When an error would bubble up, it would first encounter the Sidekiq job_retry middleware, which would set/increment the job's retry_count (code triggered here), reraise, and only then would the exception meet the Rollbar middleware, with the current job's retry_count well set.

Then 2 years ago, sidekiq/sidekiq#3235 introduced Sidekiq 5.0.0, and with it a breaking change: the retry_job would not be a middleware anymore but part of the Sidekiq processor. So it would run upstream of the Sidekiq middleware chain, including the Rollbar middleware, meaning that when an error would bubble up, it would meet the Rollbar middleware before Sidekiq's job_retry had set the correct retry_count.

# current excerpt of a Sidekiq job exception stacktrace
# The * mark the Rollbar middleware and job_retry processor

WARN: SomeErrorClass
TID-owslcdvyl
WARN: /Users/me/src/my_app/app/jobs/my_job.rb:7:in `perform'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:187:in `execute_job'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:169:in `block (2 levels) in process'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/middleware/chain.rb:128:in `block in invoke'
/Users/me/src/my_app/lib/sidekiq/middleware/server/my_custom_middleware.rb:10:in `call'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/middleware/chain.rb:130:in `block in invoke'
* /opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/rollbar-2.16.2/lib/rollbar/plugins/sidekiq/plugin.rb:11:in `call'    < Rollbar middleware handling error here
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/middleware/chain.rb:130:in `block in invoke'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/newrelic_rpm-3.18.1.330/lib/new_relic/agent/instrumentation/sidekiq.rb:33:in `block in call'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/newrelic_rpm-3.18.1.330/lib/new_relic/agent/instrumentation/controller_instrumentation.rb:363:in `perform_action_with_newrelic_trace'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/newrelic_rpm-3.18.1.330/lib/new_relic/agent/instrumentation/sidekiq.rb:29:in `call'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/middleware/chain.rb:130:in `block in invoke'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/middleware/chain.rb:133:in `invoke'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:168:in `block in process'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:139:in `block (6 levels) in dispatch'
* /opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/job_retry.rb:98:in `local'    < retry_count will be set here
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:138:in `block (5 levels) in dispatch'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/rails.rb:42:in `block in call'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/activesupport-5.0.7/lib/active_support/execution_wrapper.rb:85:in `wrap'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/activesupport-5.0.7/lib/active_support/reloader.rb:68:in `block in wrap'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/activesupport-5.0.7/lib/active_support/execution_wrapper.rb:85:in `wrap'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/activesupport-5.0.7/lib/active_support/reloader.rb:67:in `wrap'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/rails.rb:41:in `call'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:134:in `block (4 levels) in dispatch'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:199:in `stats'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:129:in `block (3 levels) in dispatch'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/job_logger.rb:8:in `call'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:128:in `block (2 levels) in dispatch'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/job_retry.rb:73:in `global'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:127:in `block in dispatch'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/logging.rb:48:in `with_context'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/logging.rb:42:in `with_job_hash_context'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:126:in `dispatch'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:167:in `process'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:85:in `process_one'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/processor.rb:73:in `run'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/util.rb:16:in `watchdog'
/opt/rubies/2.3.3/lib/ruby/gems/2.3.0/gems/sidekiq-5.1.3/lib/sidekiq/util.rb:25:in `block in safe_thread'

You'll note that I'm changing a spec about how the middleware should behave when retry_count is not set. That spec was introdued in #346 as a way to fix a nil error. It doesn't look like the intent was to actually improve the behaviour of the sidekiq_threshold and it seems the contributor may not have been using the feature, just working around a bug that it was causing. Anyway the behaviour specified doesn't make sense to me, and contradicts the expected behaviour described in the original implementation's PR (#319). If retry is truthy and retry_count is falsy, the behaviour should not be to ignore the threshold, but to enforce it by considering that retry_count is 0. Otherwise we get that bug where the first failure gets reported to Rollbar regardless of the configured threshold, before then the next n-1 next failures get rightfully ignored.

Fixes a bug where the first failure of a Sidekiq would always get reported regardless of the `sidekiq_threshold` setting, because then `job_hash['retry_count']` is not set.

Looking at rollbar#319, the current behaviour was definitely not the expected one and I believe it wasn't the original one, but the bug was introduced by a breaking change in Sidekiq that wasn't accounted for in this gem.

Back when this feature was implemented, in rollbar#319, Sidekiq retry mechanism was implemented as a Sidekiq middleware, so it would run downstream of the Rollbar middleware. When an error would bubble up, it would first encounter the Sidekiq `retry_job` middleware, which would set/increment the job's `retry_count` ([code](https://github.com/mperham/sidekiq/blob/d7d000465cd086160843fe95b8836b22d67578b6/lib/sidekiq/middleware/server/retry_jobs.rb#L107-L113) triggered [here](https://github.com/mperham/sidekiq/blob/d7d000465cd086160843fe95b8836b22d67578b6/lib/sidekiq/middleware/server/retry_jobs.rb#L83)), reraise, and only then would the exception meet the Rollbar middleware, with the current job's `retry_count` well set.

Then 2 years ago, sidekiq/sidekiq#3235 introduced Sidekiq 5.0.0, and with it a breaking change: the `retry_job` would not be a middleware anymore but part of the Sidekiq processor. So it would run upstream of the Sidekiq middleware chain, including the Rollbar middleware, meaning that when an error would bubble up, it would meet the Rollbar middleware before Sidekiq's `retry_job` had set the correct `retry_count`.

You'll note that I'm changing a spec about how the middleware should behave when `retry_count` is not set. That spec was introdued in rollbar#346 as a way to fix a `nil` error. It doesn't look like the intent was to actually improve the behaviour of the `sidekiq_threshold` and it seems the contributor may not have been using the feature, just working around a bug that it was causing. Anyway the behaviour specified doesn't make sense to me, and contradicts the expected behaviour described in the original implementation's PR (rollbar#346). If `retry` is truthy and `retry_count` is falsy, the behaviour should not be to ignore the threshold, but to enforce by considering that `retry_count` is 0. Otherwise we get that bug where the first failure gets reported to Rollbar regardless of the configured threshold, before then the next n-1 next failures get rightfully ignored.
@rivkahstandig3636
Copy link
Contributor

Thank you for this PR! We'll check it out :)

@johnsyweb
Copy link
Contributor

Thanks @ejoubaud. I think this looks good. 👍

@jessewgibbs
Copy link
Contributor

@ArturMoczulski could you take a look at this PR?

@rivkahstandig3636 rivkahstandig3636 added this to the v2.16.5 milestone Aug 9, 2018
@ArturMoczulski
Copy link
Contributor

This looks great. Thanks, @ejoubaud for contributing. Merging in.

@ArturMoczulski ArturMoczulski merged commit 8a6d7da into rollbar:master Aug 25, 2018
@ArturMoczulski ArturMoczulski modified the milestones: v2.18.0, v2.17.0 Aug 27, 2018
Erol added a commit to Erol/rollbar-gem that referenced this pull request Jun 17, 2021
There was a bug introduced on rollbar#761 which prevented errors on being
report from jobs that are running for the first time (not a retry)

mainly the following:

* `retry` would be set to a falsey value if the job is not a retry
attempt. This is wrong.
* `retry_count` would not be present in a retry attempt.

Here's a sample of a message hash for a job running first time with
retries set:

```
{"retry"=>3, "queue"=>"test", "class"=>"TestWorker", "args"=>[],
"jid"=>"db020cf431b76e59f12b54d3", "created_at"=>1623896659.149288,
"enqueued_at"=>1623896659.150045}
```

While this is the message hash for a job that has been retried first
time:

```
{"retry"=>3, "queue"=>"test", "class"=>"TestWorker", "args"=>[],
"jid"=>"db020cf431b76e59f12b54d3", "created_at"=>1623896659.149288,
"enqueued_at"=>1623896687.795098, "error_message"=>"undefined local
variable or method `meh' for #<TestWorker:0x00007fe79eeb7480>",
"error_class"=>"NameError", "failed_at"=>1623896663.811552,
"retry_count"=>0}
```

This commit aims to correct the bug introduced and allow failed jobs to
report errors when they are running for the first time.
Erol added a commit to Erol/rollbar-gem that referenced this pull request Jun 17, 2021
There was a bug introduced on rollbar#761 which prevented errors on being
reported from jobs that are running for the first time (not a retry).

Some wrong assumptions were made on the behavior of retries on rollbar#761,
mainly the following:

* `retry` would be set to a falsey value if the job is not a retry
attempt. This is wrong.
* `retry_count` would not be present in a retry attempt.

Here's a sample of a message hash for a job running first time with
retries set:

```
{"retry"=>3, "queue"=>"test", "class"=>"TestWorker", "args"=>[],
"jid"=>"db020cf431b76e59f12b54d3", "created_at"=>1623896659.149288,
"enqueued_at"=>1623896659.150045}
```

While this is the message hash for a job that has been retried first
time:

```
{"retry"=>3, "queue"=>"test", "class"=>"TestWorker", "args"=>[],
"jid"=>"db020cf431b76e59f12b54d3", "created_at"=>1623896659.149288,
"enqueued_at"=>1623896687.795098, "error_message"=>"undefined local
variable or method `meh' for #<TestWorker:0x00007fe79eeb7480>",
"error_class"=>"NameError", "failed_at"=>1623896663.811552,
"retry_count"=>0}
```

This commit aims to correct the bug introduced and allow failed jobs to
report errors when they are running for the first time.
Erol added a commit to Erol/rollbar-gem that referenced this pull request Jun 17, 2021
There was a bug introduced on rollbar#761 which prevented errors on being
reported from jobs that are running for the first time (not a retry).

Some wrong assumptions were made on the behavior of retries on rollbar#761,
mainly the following:

* `retry` would be set to a falsey value if the job is not a retry
attempt. This is wrong.
* `retry_count` would not be present in a retry attempt.

Here's a sample of a message hash for a job running first time with
retries set:

```
{"retry"=>3, "queue"=>"test", "class"=>"TestWorker", "args"=>[],
"jid"=>"db020cf431b76e59f12b54d3", "created_at"=>1623896659.149288,
"enqueued_at"=>1623896659.150045}
```

While this is the message hash for a job that has been retried first
time:

```
{"retry"=>3, "queue"=>"test", "class"=>"TestWorker", "args"=>[],
"jid"=>"db020cf431b76e59f12b54d3", "created_at"=>1623896659.149288,
"enqueued_at"=>1623896687.795098, "error_message"=>"undefined local
variable or method `meh' for #<TestWorker:0x00007fe79eeb7480>",
"error_class"=>"NameError", "failed_at"=>1623896663.811552,
"retry_count"=>0}
```

This commit aims to correct the bug introduced and allow failed jobs to
report errors when they are running for the first time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants