Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cache instead of map in the plugin promwrapper #14800

Merged
merged 9 commits into from
Nov 4, 2024

Conversation

0xnogo
Copy link
Collaborator

@0xnogo 0xnogo commented Oct 16, 2024

Context:
A memory leak was detected in CCIP nodes, traced to the reportEndTimes map in promwrapper/plugin.go. This map stores timestamps during the Report method but relies on ShouldAcceptFinalizedReport to remove entries. However, in OCR2, if Report returns false on some nodes, ShouldAcceptFinalizedReport may not always run, leaving stale entries that accumulate over time.

Solution:
We replaced the reportEndTimes map with a cache that supports expiration. As the data is only used for logging, this change avoids memory leaks without affecting performance or critical paths. Entries are now automatically cleaned up, preventing unbounded memory growth.

Testing:
This fix was tested in the beta-testnet environment with the following results:
With fix: ~5 days, memory usage stabilized between 217 to 242 MB.
Without fix: ~4 days, memory usage increased from 200 to 347 MB.

Copy link
Contributor

github-actions bot commented Oct 16, 2024

Below is an analysis created by an LLM. Be mindful of hallucinations and verify accuracy.

WF: CI Core#88cc1ec

1. HTTP 503 Service Temporarily Unavailable error during re-run of flakey tests:[Flakey Test Detection]

Source of Error:
Flakey Test Detection	Re-run tests	2024-10-16T12:39:47.8368228Z 2024/10/16 12:39:47 Error re-running flakey tests: push request failed: status=503, body=<html>
Flakey Test Detection	Re-run tests	2024-10-16T12:39:47.8370065Z <head><title>503 Service Temporarily Unavailable</title></head>
Flakey Test Detection	Re-run tests	2024-10-16T12:39:47.8370997Z <body>
Flakey Test Detection	Re-run tests	2024-10-16T12:39:47.8371712Z <center><h1>503 Service Temporarily Unavailable</h1></center>
Flakey Test Detection	Re-run tests	2024-10-16T12:39:47.8372504Z <hr><center>nginx</center>
Flakey Test Detection	Re-run tests	2024-10-16T12:39:47.8373011Z </body>
Flakey Test Detection	Re-run tests	2024-10-16T12:39:47.8373372Z </html>
Flakey Test Detection	Re-run tests	2024-10-16T12:39:47.8397293Z ##[error]Process completed with exit code 1.

Why: The error occurred because the service used to re-run flakey tests was temporarily unavailable, likely due to server-side issues such as maintenance or overload.

Suggested fix: Implement retry logic with exponential backoff in the test runner script to handle temporary service unavailability gracefully. Alternatively, check the service status or contact the service provider for more details.

@0xnogo 0xnogo changed the base branch from develop to dk/rmn-integ-test October 22, 2024 12:11
@0xnogo 0xnogo changed the base branch from dk/rmn-integ-test to develop October 23, 2024 08:23
Copy link
Contributor

github-actions bot commented Oct 23, 2024

AER Report: CI Core ran successfully ✅

aer_workflow , commit

AER Report: Operator UI CI

aer_workflow , commit , Breaking Changes GQL Check

1. Workflow triggered but failed to complete successfully: [Breaking Changes GQL Check]

Source of Error:
2024-11-04T13:05:41.9889244Z Checking conclusion [failure]
2024-11-04T13:05:41.9890486Z Checking status [completed]
2024-11-04T13:05:41.9892067Z Conclusion is not success, it's [failure].
2024-11-04T13:05:41.9892999Z Propagating failure to upstream job

Why: The triggered workflow did not complete successfully. The status check returned a conclusion of "failure" instead of "success," causing the upstream job to propagate the failure.

Suggested fix: Investigate the logs of the downstream workflow (ID: 11664801213) to identify the specific cause of failure and address the underlying issue. Ensure all dependencies and configurations are correct.

@0xnogo 0xnogo marked this pull request as ready for review October 23, 2024 12:54
@0xnogo 0xnogo requested review from a team as code owners October 23, 2024 12:54
@mateusz-sekara mateusz-sekara self-requested a review October 23, 2024 13:10
mateusz-sekara
mateusz-sekara previously approved these changes Oct 23, 2024
winder
winder previously approved these changes Oct 23, 2024
Copy link
Contributor

@winder winder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just one question about the approach.

core/services/ocr2/plugins/promwrapper/plugin.go Outdated Show resolved Hide resolved
core/services/ocr2/plugins/promwrapper/plugin.go Outdated Show resolved Hide resolved
@0xnogo 0xnogo dismissed stale reviews from winder and mateusz-sekara via de3993f October 23, 2024 14:10
@0xnogo 0xnogo requested a review from a team as a code owner October 30, 2024 15:09
@cl-sonarqube-production
Copy link

@0xnogo 0xnogo added this pull request to the merge queue Nov 4, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 4, 2024
@0xnogo 0xnogo added this pull request to the merge queue Nov 4, 2024
Merged via the queue into develop with commit 7bac85f Nov 4, 2024
135 of 136 checks passed
@0xnogo 0xnogo deleted the ng/fix-memory-promwrapper branch November 4, 2024 14:00
0xnogo added a commit to smartcontractkit/ccip that referenced this pull request Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants