-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BCFR-1087 Detect Finality Violation on Backfill #15825
BCFR-1087 Detect Finality Violation on Backfill #15825
Conversation
I see you updated files related to
|
AER Report: CI Core ran successfully ✅AER Report: Operator UI CIaer_workflow , commit , Breaking Changes GQL Check 1. Workflow triggered downstream job failed: breaking-changes-gql-checkSource of Error:
Why: The error indicates that the downstream workflow triggered by the Suggested fix: Investigate the logs of the downstream workflow at the provided URL to identify the specific cause of the failure. Address the issue in the downstream workflow to ensure it completes successfully. |
} | ||
|
||
if err != nil { | ||
lp.lggr.Errorw("Failed to poll and save logs, retrying later", "err", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
food for thought, should we track these errors using a more "observable" approach? For instance, by increasing a prometheus counter whenever pollAndSave or backup fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that it's really helpful to track this as a metric. Failures of pollAndSave and backup are normal (RPC failed for one of the requests), so we won't be using it on overview dashboards. We might want to introduce something of a higher level. Like the latest block successfully processed by LogPoller and compare it to one observed by HT or another component. High delta of these two values signals that PollAndSaved failed too often.
In any case it seems to be out of scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the "failed to poll and save logs due to finality violation" one is worth tracking, but as I understand it we're already tracking that indirectly because Healthy()
will start returning false as soon as it happens. The other ones are fairly normal, will happen due to any sort of temporary network instability which presumably is also tracked elsewhere
Quality Gate passedIssues Measures |
Improves detection of finality violations during backfill operation. DD
Test namespace load-ccip-d6901