-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix sidekiq_threshold ignored on first failure #761
Merged
ArturMoczulski
merged 1 commit into
rollbar:master
from
ejoubaud:fix-rollbar-sidekiq-threshold
Aug 25, 2018
Merged
Fix sidekiq_threshold ignored on first failure #761
ArturMoczulski
merged 1 commit into
rollbar:master
from
ejoubaud:fix-rollbar-sidekiq-threshold
Aug 25, 2018
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fixes a bug where the first failure of a Sidekiq would always get reported regardless of the `sidekiq_threshold` setting, because then `job_hash['retry_count']` is not set. Looking at rollbar#319, the current behaviour was definitely not the expected one and I believe it wasn't the original one, but the bug was introduced by a breaking change in Sidekiq that wasn't accounted for in this gem. Back when this feature was implemented, in rollbar#319, Sidekiq retry mechanism was implemented as a Sidekiq middleware, so it would run downstream of the Rollbar middleware. When an error would bubble up, it would first encounter the Sidekiq `retry_job` middleware, which would set/increment the job's `retry_count` ([code](https://github.com/mperham/sidekiq/blob/d7d000465cd086160843fe95b8836b22d67578b6/lib/sidekiq/middleware/server/retry_jobs.rb#L107-L113) triggered [here](https://github.com/mperham/sidekiq/blob/d7d000465cd086160843fe95b8836b22d67578b6/lib/sidekiq/middleware/server/retry_jobs.rb#L83)), reraise, and only then would the exception meet the Rollbar middleware, with the current job's `retry_count` well set. Then 2 years ago, sidekiq/sidekiq#3235 introduced Sidekiq 5.0.0, and with it a breaking change: the `retry_job` would not be a middleware anymore but part of the Sidekiq processor. So it would run upstream of the Sidekiq middleware chain, including the Rollbar middleware, meaning that when an error would bubble up, it would meet the Rollbar middleware before Sidekiq's `retry_job` had set the correct `retry_count`. You'll note that I'm changing a spec about how the middleware should behave when `retry_count` is not set. That spec was introdued in rollbar#346 as a way to fix a `nil` error. It doesn't look like the intent was to actually improve the behaviour of the `sidekiq_threshold` and it seems the contributor may not have been using the feature, just working around a bug that it was causing. Anyway the behaviour specified doesn't make sense to me, and contradicts the expected behaviour described in the original implementation's PR (rollbar#346). If `retry` is truthy and `retry_count` is falsy, the behaviour should not be to ignore the threshold, but to enforce by considering that `retry_count` is 0. Otherwise we get that bug where the first failure gets reported to Rollbar regardless of the configured threshold, before then the next n-1 next failures get rightfully ignored.
Thank you for this PR! We'll check it out :) |
Thanks @ejoubaud. I think this looks good. 👍 |
@ArturMoczulski could you take a look at this PR? |
This looks great. Thanks, @ejoubaud for contributing. Merging in. |
Erol
added a commit
to Erol/rollbar-gem
that referenced
this pull request
Jun 17, 2021
There was a bug introduced on rollbar#761 which prevented errors on being report from jobs that are running for the first time (not a retry) mainly the following: * `retry` would be set to a falsey value if the job is not a retry attempt. This is wrong. * `retry_count` would not be present in a retry attempt. Here's a sample of a message hash for a job running first time with retries set: ``` {"retry"=>3, "queue"=>"test", "class"=>"TestWorker", "args"=>[], "jid"=>"db020cf431b76e59f12b54d3", "created_at"=>1623896659.149288, "enqueued_at"=>1623896659.150045} ``` While this is the message hash for a job that has been retried first time: ``` {"retry"=>3, "queue"=>"test", "class"=>"TestWorker", "args"=>[], "jid"=>"db020cf431b76e59f12b54d3", "created_at"=>1623896659.149288, "enqueued_at"=>1623896687.795098, "error_message"=>"undefined local variable or method `meh' for #<TestWorker:0x00007fe79eeb7480>", "error_class"=>"NameError", "failed_at"=>1623896663.811552, "retry_count"=>0} ``` This commit aims to correct the bug introduced and allow failed jobs to report errors when they are running for the first time.
Erol
added a commit
to Erol/rollbar-gem
that referenced
this pull request
Jun 17, 2021
There was a bug introduced on rollbar#761 which prevented errors on being reported from jobs that are running for the first time (not a retry). Some wrong assumptions were made on the behavior of retries on rollbar#761, mainly the following: * `retry` would be set to a falsey value if the job is not a retry attempt. This is wrong. * `retry_count` would not be present in a retry attempt. Here's a sample of a message hash for a job running first time with retries set: ``` {"retry"=>3, "queue"=>"test", "class"=>"TestWorker", "args"=>[], "jid"=>"db020cf431b76e59f12b54d3", "created_at"=>1623896659.149288, "enqueued_at"=>1623896659.150045} ``` While this is the message hash for a job that has been retried first time: ``` {"retry"=>3, "queue"=>"test", "class"=>"TestWorker", "args"=>[], "jid"=>"db020cf431b76e59f12b54d3", "created_at"=>1623896659.149288, "enqueued_at"=>1623896687.795098, "error_message"=>"undefined local variable or method `meh' for #<TestWorker:0x00007fe79eeb7480>", "error_class"=>"NameError", "failed_at"=>1623896663.811552, "retry_count"=>0} ``` This commit aims to correct the bug introduced and allow failed jobs to report errors when they are running for the first time.
Erol
added a commit
to Erol/rollbar-gem
that referenced
this pull request
Jun 17, 2021
There was a bug introduced on rollbar#761 which prevented errors on being reported from jobs that are running for the first time (not a retry). Some wrong assumptions were made on the behavior of retries on rollbar#761, mainly the following: * `retry` would be set to a falsey value if the job is not a retry attempt. This is wrong. * `retry_count` would not be present in a retry attempt. Here's a sample of a message hash for a job running first time with retries set: ``` {"retry"=>3, "queue"=>"test", "class"=>"TestWorker", "args"=>[], "jid"=>"db020cf431b76e59f12b54d3", "created_at"=>1623896659.149288, "enqueued_at"=>1623896659.150045} ``` While this is the message hash for a job that has been retried first time: ``` {"retry"=>3, "queue"=>"test", "class"=>"TestWorker", "args"=>[], "jid"=>"db020cf431b76e59f12b54d3", "created_at"=>1623896659.149288, "enqueued_at"=>1623896687.795098, "error_message"=>"undefined local variable or method `meh' for #<TestWorker:0x00007fe79eeb7480>", "error_class"=>"NameError", "failed_at"=>1623896663.811552, "retry_count"=>0} ``` This commit aims to correct the bug introduced and allow failed jobs to report errors when they are running for the first time.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a bug where the first failure of a Sidekiq job would always get reported regardless of the
sidekiq_threshold
setting, because thenjob_hash['retry_count']
is not set.Looking at #319, the current behaviour was definitely not the expected one and I believe it wasn't the original one, but that the bug was introduced by a breaking change in Sidekiq that wasn't accounted for in this gem.
Details
Back when this feature was implemented, in #319, Sidekiq retry mechanism was implemented as a Sidekiq middleware, so it could run downstream of the Rollbar middleware. When an error would bubble up, it would first encounter the Sidekiq
job_retry
middleware, which would set/increment the job'sretry_count
(code triggered here), reraise, and only then would the exception meet the Rollbar middleware, with the current job'sretry_count
well set.Then 2 years ago, sidekiq/sidekiq#3235 introduced Sidekiq 5.0.0, and with it a breaking change: the
retry_job
would not be a middleware anymore but part of the Sidekiq processor. So it would run upstream of the Sidekiq middleware chain, including the Rollbar middleware, meaning that when an error would bubble up, it would meet the Rollbar middleware before Sidekiq'sjob_retry
had set the correctretry_count
.You'll note that I'm changing a spec about how the middleware should behave when
retry_count
is not set. That spec was introdued in #346 as a way to fix anil
error. It doesn't look like the intent was to actually improve the behaviour of thesidekiq_threshold
and it seems the contributor may not have been using the feature, just working around a bug that it was causing. Anyway the behaviour specified doesn't make sense to me, and contradicts the expected behaviour described in the original implementation's PR (#319). Ifretry
is truthy andretry_count
is falsy, the behaviour should not be to ignore the threshold, but to enforce it by considering thatretry_count
is 0. Otherwise we get that bug where the first failure gets reported to Rollbar regardless of the configured threshold, before then the next n-1 next failures get rightfully ignored.