Remove RISC network errors from NewRelic #10806

vrajmohan · 2024-06-12T14:38:05Z

🎫 Ticket

Link to the relevant ticket:
https://gitlab.login.gov/lg-people/lg-people-appdev/Melba/backlog-fy24/-/issues/37

🛠 Summary of changes

Don't bubble up the network errors. Let ActiveJob retry based on its configuration.

📜 Testing Plan

Provide a checklist of steps to confirm the changes.

Network errors in the RISC delivery job should not show up in NewRelic errors dashboard

mitchellhenke · 2024-06-12T14:43:36Z

app/jobs/risc_delivery_job.rb

@@ -53,8 +53,6 @@ def perform(
      user:,
    )
  rescue *NETWORK_ERRORS => err
-    raise err if self.executions < 2 && !inline?


Does removing this affect the retry logic? I think it may depend on the job failing.

I was wondering about that as well.

I also have a question here about whether we would also like to reduce the events logged.

The best approach seems to be:
Provide a block to retry_on to log the event after the last attempt fails. Don't do any rescue within the perform.

These 2 statements in the Rails ActiveJob guide do seem to indicate that @mitchellhenke is right:

If an exception from a job is not rescued, then the job is referred to as "failed".

A failed job will not be retried, unless configured otherwise.

Logging is fine to keep for every try imo

Looking at this, and looking at the ActiveJob source code for retry_on, I think the way to stop reporting exceptions to NewRelic would be to add a block to our retry_on up top and make it a no-op, because the block is only called on the last/final time, so providing a block would be tell it to not keep raising

From the GoodJob documentation:

When using retry_on with a limited number of retries, the final exception will not be rescued and will raise to GoodJob's error handler. To avoid this, pass a block to retry_on to handle the final exception instead of raising it to GoodJob:

class ApplicationJob < ActiveJob::Base retry_on StandardError, attempts: 5 do |_job, _exception| # Log error, do nothing, etc. end # ... end

Perhaps the right thing to do is to leave the existing rescue handler in, and use the block argument to retry_on to do nothing. This would work if the errors we are seeing on NewRelic is only after the retries have failed. I'll verify that this is true.

zachmargolis · 2024-06-14T16:15:35Z

app/jobs/risc_delivery_job.rb

+  ) do |_job, _exception|
+  end


It's a little unclear from this why we have a blank block, I think this would be a good candidate for a comment, or since we have 2, extracing a proc into a variable with a clear name that indicates what's going on

FINAL_EXCEPTION_HANDLER_NO_OP = proc { |_job, _exception| }.freeze retry_on( *NETWORK_ERRORS, wait: :polynomially_longer, attempts: 2, &FINAL_EXCEPTION_HANDLER_NO_OP )

To me, a comment would be clear and concise, but if a named Proc would be idiomatic, I'll go with that.

It's 6 of one, half a dozen of the other. I don't think one is more idiomatic

zachmargolis · 2024-06-14T16:18:12Z

spec/jobs/risc_delivery_job_spec.rb

@@ -96,6 +96,7 @@

          it 'logs an event' do
            expect { perform }.to_not raise_error
+            expect(NewRelic::Agent).not_to receive(:notice_error)


So the code in this test never triggers a call to NewRelic (that happens as part of the GoodJob.on_thread_error that we set elsewhere)

So since this wouldn't catch a regression, I would remove it

See https://gitlab.login.gov/lg-people/lg-people-appdev/Melba/backlog-fy24/-/issues/37 **Why**: Network errors encountered when posting RISC events result in a lot of noise in NewRelic. They are not of much value in NewRelic as they are also logged as risc_security_event_pushed events. **How**: Don't bubble up the network errors. Do nothing when GoodJob exhausts its reties. changelog: Internal, Error handling, Remove RISC network errors from NewRelic

vrajmohan requested a review from zachmargolis June 12, 2024 14:38

mitchellhenke reviewed Jun 12, 2024

View reviewed changes

vrajmohan force-pushed the dont-bubble-up-risc-network-errors branch from 26499bf to 8b8592a Compare June 14, 2024 01:19

vrajmohan marked this pull request as draft June 14, 2024 01:21

zachmargolis reviewed Jun 14, 2024

View reviewed changes

vrajmohan force-pushed the dont-bubble-up-risc-network-errors branch from 8b8592a to 6fdbb5a Compare June 14, 2024 23:49

vrajmohan force-pushed the dont-bubble-up-risc-network-errors branch from 6fdbb5a to 2bf8513 Compare June 14, 2024 23:49

vrajmohan marked this pull request as ready for review June 15, 2024 11:36

zachmargolis approved these changes Jun 17, 2024

View reviewed changes

vrajmohan merged commit f790276 into main Jun 17, 2024
2 checks passed

vrajmohan deleted the dont-bubble-up-risc-network-errors branch June 17, 2024 19:51

This was referenced Jun 20, 2024

Deploy RC 391 to Production #10840

Closed

Deploy RC 391 to Production #10845

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove RISC network errors from NewRelic #10806

Remove RISC network errors from NewRelic #10806

vrajmohan commented Jun 12, 2024

mitchellhenke Jun 12, 2024

vrajmohan Jun 12, 2024 •

edited

Loading

vrajmohan Jun 12, 2024

mitchellhenke Jun 12, 2024

zachmargolis Jun 12, 2024

vrajmohan Jun 12, 2024

vrajmohan Jun 12, 2024

zachmargolis Jun 14, 2024 •

edited

Loading

vrajmohan Jun 14, 2024

zachmargolis Jun 14, 2024

zachmargolis Jun 14, 2024

Remove RISC network errors from NewRelic #10806

Remove RISC network errors from NewRelic #10806

Conversation

vrajmohan commented Jun 12, 2024

🎫 Ticket

🛠 Summary of changes

📜 Testing Plan

Choose a reason for hiding this comment

vrajmohan Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zachmargolis Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vrajmohan Jun 12, 2024 •

edited

Loading

zachmargolis Jun 14, 2024 •

edited

Loading