-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove RISC network errors from NewRelic #10806
Conversation
app/jobs/risc_delivery_job.rb
Outdated
@@ -53,8 +53,6 @@ def perform( | |||
user:, | |||
) | |||
rescue *NETWORK_ERRORS => err | |||
raise err if self.executions < 2 && !inline? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does removing this affect the retry logic? I think it may depend on the job failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about that as well.
I also have a question here about whether we would also like to reduce the events logged.
The best approach seems to be:
Provide a block to retry_on
to log the event after the last attempt fails. Don't do any rescue
within the perform
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 2 statements in the Rails ActiveJob guide do seem to indicate that @mitchellhenke is right:
If an exception from a job is not rescued, then the job is referred to as "failed".
A failed job will not be retried, unless configured otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging is fine to keep for every try imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this, and looking at the ActiveJob source code for retry_on
, I think the way to stop reporting exceptions to NewRelic would be to add a block to our retry_on
up top and make it a no-op, because the block is only called on the last/final time, so providing a block would be tell it to not keep raising
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the GoodJob
documentation:
When using retry_on with a limited number of retries, the final exception will not be rescued and will raise to GoodJob's error handler. To avoid this, pass a block to retry_on to handle the final exception instead of raising it to GoodJob:
class ApplicationJob < ActiveJob::Base
retry_on StandardError, attempts: 5 do |_job, _exception|
# Log error, do nothing, etc.
end
# ...
end
Perhaps the right thing to do is to leave the existing rescue
handler in, and use the block argument to retry_on
to do nothing. This would work if the errors we are seeing on NewRelic is only after the retries have failed. I'll verify that this is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jinx
26499bf
to
8b8592a
Compare
app/jobs/risc_delivery_job.rb
Outdated
) do |_job, _exception| | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little unclear from this why we have a blank block, I think this would be a good candidate for a comment, or since we have 2, extracing a proc into a variable with a clear name that indicates what's going on
FINAL_EXCEPTION_HANDLER_NO_OP = proc { |_job, _exception| }.freeze
retry_on(
*NETWORK_ERRORS,
wait: :polynomially_longer,
attempts: 2,
&FINAL_EXCEPTION_HANDLER_NO_OP
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, a comment would be clear and concise, but if a named Proc
would be idiomatic, I'll go with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's 6 of one, half a dozen of the other. I don't think one is more idiomatic
spec/jobs/risc_delivery_job_spec.rb
Outdated
@@ -96,6 +96,7 @@ | |||
|
|||
it 'logs an event' do | |||
expect { perform }.to_not raise_error | |||
expect(NewRelic::Agent).not_to receive(:notice_error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the code in this test never triggers a call to NewRelic (that happens as part of the GoodJob.on_thread_error
that we set elsewhere)
So since this wouldn't catch a regression, I would remove it
8b8592a
to
6fdbb5a
Compare
See https://gitlab.login.gov/lg-people/lg-people-appdev/Melba/backlog-fy24/-/issues/37 **Why**: Network errors encountered when posting RISC events result in a lot of noise in NewRelic. They are not of much value in NewRelic as they are also logged as risc_security_event_pushed events. **How**: Don't bubble up the network errors. Do nothing when GoodJob exhausts its reties. changelog: Internal, Error handling, Remove RISC network errors from NewRelic
6fdbb5a
to
2bf8513
Compare
🎫 Ticket
Link to the relevant ticket:
https://gitlab.login.gov/lg-people/lg-people-appdev/Melba/backlog-fy24/-/issues/37
🛠 Summary of changes
Don't bubble up the network errors. Let ActiveJob retry based on its configuration.
📜 Testing Plan
Provide a checklist of steps to confirm the changes.