Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove RISC network errors from NewRelic #10806

Merged
merged 1 commit into from
Jun 17, 2024

Conversation

vrajmohan
Copy link
Member

🎫 Ticket

Link to the relevant ticket:
https://gitlab.login.gov/lg-people/lg-people-appdev/Melba/backlog-fy24/-/issues/37

🛠 Summary of changes

Don't bubble up the network errors. Let ActiveJob retry based on its configuration.

📜 Testing Plan

Provide a checklist of steps to confirm the changes.

  • Network errors in the RISC delivery job should not show up in NewRelic errors dashboard

@@ -53,8 +53,6 @@ def perform(
user:,
)
rescue *NETWORK_ERRORS => err
raise err if self.executions < 2 && !inline?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does removing this affect the retry logic? I think it may depend on the job failing.

Copy link
Member Author

@vrajmohan vrajmohan Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about that as well.

I also have a question here about whether we would also like to reduce the events logged.

The best approach seems to be:
Provide a block to retry_on to log the event after the last attempt fails. Don't do any rescue within the perform.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 statements in the Rails ActiveJob guide do seem to indicate that @mitchellhenke is right:

If an exception from a job is not rescued, then the job is referred to as "failed".

A failed job will not be retried, unless configured otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging is fine to keep for every try imo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this, and looking at the ActiveJob source code for retry_on, I think the way to stop reporting exceptions to NewRelic would be to add a block to our retry_on up top and make it a no-op, because the block is only called on the last/final time, so providing a block would be tell it to not keep raising

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the GoodJob documentation:

When using retry_on with a limited number of retries, the final exception will not be rescued and will raise to GoodJob's error handler. To avoid this, pass a block to retry_on to handle the final exception instead of raising it to GoodJob:

class ApplicationJob < ActiveJob::Base
  retry_on StandardError, attempts: 5 do |_job, _exception|
    # Log error, do nothing, etc.
  end
  # ...
end

Perhaps the right thing to do is to leave the existing rescue handler in, and use the block argument to retry_on to do nothing. This would work if the errors we are seeing on NewRelic is only after the retries have failed. I'll verify that this is true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jinx

@vrajmohan vrajmohan force-pushed the dont-bubble-up-risc-network-errors branch from 26499bf to 8b8592a Compare June 14, 2024 01:19
@vrajmohan vrajmohan marked this pull request as draft June 14, 2024 01:21
Comment on lines 17 to 19
) do |_job, _exception|
end
Copy link
Contributor

@zachmargolis zachmargolis Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little unclear from this why we have a blank block, I think this would be a good candidate for a comment, or since we have 2, extracing a proc into a variable with a clear name that indicates what's going on

  FINAL_EXCEPTION_HANDLER_NO_OP = proc { |_job, _exception| }.freeze

  retry_on(
    *NETWORK_ERRORS,
    wait: :polynomially_longer,
    attempts: 2,
    &FINAL_EXCEPTION_HANDLER_NO_OP
  )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, a comment would be clear and concise, but if a named Proc would be idiomatic, I'll go with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's 6 of one, half a dozen of the other. I don't think one is more idiomatic

@@ -96,6 +96,7 @@

it 'logs an event' do
expect { perform }.to_not raise_error
expect(NewRelic::Agent).not_to receive(:notice_error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the code in this test never triggers a call to NewRelic (that happens as part of the GoodJob.on_thread_error that we set elsewhere)

So since this wouldn't catch a regression, I would remove it

@vrajmohan vrajmohan force-pushed the dont-bubble-up-risc-network-errors branch from 8b8592a to 6fdbb5a Compare June 14, 2024 23:49
See
https://gitlab.login.gov/lg-people/lg-people-appdev/Melba/backlog-fy24/-/issues/37

**Why**:
Network errors encountered when posting RISC events result in a lot of
noise in NewRelic. They are not of much value in NewRelic as they
are also logged as risc_security_event_pushed events.

**How**:
Don't bubble up the network errors. Do nothing when GoodJob exhausts its
reties.

changelog: Internal, Error handling, Remove RISC network errors from NewRelic
@vrajmohan vrajmohan force-pushed the dont-bubble-up-risc-network-errors branch from 6fdbb5a to 2bf8513 Compare June 14, 2024 23:49
@vrajmohan vrajmohan marked this pull request as ready for review June 15, 2024 11:36
@vrajmohan vrajmohan merged commit f790276 into main Jun 17, 2024
2 checks passed
@vrajmohan vrajmohan deleted the dont-bubble-up-risc-network-errors branch June 17, 2024 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants