Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Graceful Recovery Baseline Test #1111

Merged
merged 16 commits into from
Oct 11, 2023

Conversation

bjee19
Copy link
Contributor

@bjee19 bjee19 commented Oct 5, 2023

Proposed changes

Problem: We want to have a test that checks for how well NGF recovers from container failures.

Solution: Added a manual test that checks for when the nginx-gateway container fails, when the nginx container fails, when the NGF Pod restarts due to a node restart with cleaning up of resources, and when the NGF Pod restarts due to a node restart without cleaning up of resources.

Closes #951

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Oct 5, 2023
@bjee19
Copy link
Contributor Author

bjee19 commented Oct 5, 2023

Would like specific feedback on the last two test cases.

  • Can't think of a good name for them
  • The goal was to test the behavior of the NGF Pod on node restart and the easiest way I saw to do that was to do it on a Kind cluster and restart the docker container which is running the cluster. However, I did some digging around and it looks like docker restart sends a SIGTERM signal but if the processes don't cleanup in 10ish seconds it will forcibly terminate. This forced termination is what I'm testing when I don't specifically drain and delete a node (3rd test case) and is the test that usually fails. Is it alright that I included both cases?

Copy link
Contributor

@kate-osborn kate-osborn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Very thorough instructions 👍

Can you create a directory under tests for this test? I know there's only one file, but I think it will be good to match the structure of the other testing PRs.

See my review for a few comments and suggestions.

tests/graceful-recovery.md Outdated Show resolved Hide resolved
tests/graceful-recovery.md Outdated Show resolved Hide resolved
tests/graceful-recovery.md Outdated Show resolved Hide resolved
tests/results/graceful-recover-results.md Outdated Show resolved Hide resolved
tests/graceful-recovery.md Outdated Show resolved Hide resolved
tests/graceful-recovery.md Outdated Show resolved Hide resolved
tests/graceful-recovery.md Outdated Show resolved Hide resolved
tests/graceful-recovery.md Outdated Show resolved Hide resolved
Copy link
Contributor

@pleshakov pleshakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bjee19 Please see a few major and minor requests and comments.
Big one - let's use SIGKILL because SIGTERM kicks in graceful, not abrupt (failure), termination.

tests/graceful-recovery.md Outdated Show resolved Hide resolved
tests/graceful-recovery.md Outdated Show resolved Hide resolved
tests/graceful-recovery.md Outdated Show resolved Hide resolved
tests/graceful-recovery.md Outdated Show resolved Hide resolved
tests/graceful-recovery.md Outdated Show resolved Hide resolved
tests/graceful-recovery.md Outdated Show resolved Hide resolved
tests/results/graceful-recover-results.md Outdated Show resolved Hide resolved
tests/results/graceful-recover-results.md Outdated Show resolved Hide resolved
tests/graceful-recovery.md Outdated Show resolved Hide resolved
tests/results/graceful-recover-results.md Outdated Show resolved Hide resolved
@bjee19 bjee19 marked this pull request as ready for review October 10, 2023 23:41
@bjee19 bjee19 requested a review from a team as a code owner October 10, 2023 23:41
Copy link
Contributor

@kate-osborn kate-osborn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Contributor

@pleshakov pleshakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good stuff 👍

@bjee19 bjee19 force-pushed the baseline-test-graceful-recovery branch from aa6255b to 95f8157 Compare October 11, 2023 16:48
@bjee19 bjee19 merged commit d501d59 into nginx:main Oct 11, 2023
@ciarams87 ciarams87 added the tests Pull requests that update tests label Oct 17, 2023
@bjee19 bjee19 deleted the baseline-test-graceful-recovery branch November 20, 2023 18:41
miledxz added a commit to miledxz/nginx-gateway-fabric that referenced this pull request Jan 14, 2025
Add Graceful Recovery Baseline Test

Problem: We want to have a test that checks for how well NGF recovers from container failures.

Solution: Added manual tests that checked for when the nginx-gateway container restarted, when the NGINX container restarted, when we drained a node then restarted it, and when we didn't drain a node then restarted it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation tests Pull requests that update tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Baseline Testing: Graceful recovery from restarts
4 participants