-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stability: Support Running Tests in Parallel #8423
Conversation
htiting docker daemon issues after ~15 successful tests
…loo into stability/parallel-tests
Issues linked to changelog: |
This reverts commit dad4783.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broadly looks good, but I'm not clear what the value is in merging this PR if we can't use it in CI and don't recommend using it locally
Is the idea that we aren't pursuing either at the moment and want to commit what we have before returning to those pieces later? If so, can we open follow-up issues and link them to this PR?
Before the commit disable parallel tests in CI you can see that running in parallel in CI was successful.
This will be a bit hard to find if/when we're ready to enable parallel tests in CI
Perhaps these values can be committed somewhere, or at least written directly in the PR description
There are some tests that use services (consul, vault) whose ports are currently hard-coded to a single value.
Why can't we assign hose dynamically to available ports as we do elsewhere in this PR? I guess because those services are set up at cluster creation time? Would there be any benefit, presumably in a followup PR, to expose these services on multiple ports to enable those tests to be run in parallel as well?
test/ginkgo/parallel/ports.go
Outdated
// It relies on a hard-coded denylist, of ports that we know are hard-coded in Gloo | ||
// And will cause tests to fail if they attempt to bind on the same port | ||
// If you need a more advanced port selection mechanism, use AdvancePortSafe | ||
func AdvancePortSafeDenylist(p *uint32) uint32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this isn't actually used
What is the justification for keeping this implementation in code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the first iteration of my work, and I actually found it easier to write a unit test that the retry logic (to find a new port) was working as epxected. It pointed out a bug in the AdvancePortSafe logic, so I found this useful. I agree that this isn't super useful, but it is nice to have a test case for it. I guess, one option would be to use this in the unit tests (and define it there directly). That way its not exported in this file, but still can be useful. What do you think about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable
@@ -12,7 +12,7 @@ import ( | |||
"github.com/solo-io/gloo/test/testutils" | |||
) | |||
|
|||
var _ = Describe("Add", func() { | |||
var _ = Describe("Add", Serial, func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add comments explaining the use of Serial
everywhere we use it?
This is done inconsistently in the PR (for example here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call! Reached out in Slack if you think the Serial is useful, given that we don't turn on tests in parallel. If we move forward with it, Ill be sure to add comments to document each usage of it
Yeah it's a good point. I think that running non-e2e tests in parallel can be useful locally, and may speed up local development. For example, I change a helm value and want to run all the helm unit tests. My hope was that the Serial labels will show which tests cannot be run serially. so if Ginkgo fixes the reporting using parallel processes, we can turn it on more broadly.
Yeah good call. What I was trying to get at was less the values that I used for concurrency, and more the demonstration that when run in CI using concurrency it passed. I will update the PR to call that out better.
We could explore that, but I decided against fixing all the cases of code not handling port rotation properly. I fiugred that would be something we could explore if/when we needed to. |
Co-authored-by: Bernie Birnbaum <bewebi@earthlink.net>
…loo into stability/parallel-tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Description
Explore test parallelization to identify other potential sources of flakes
Context
Tests that are subtly dependent on other tests or environment conditions, can lead to test pollution and flakes. To expose this, a while ago we turned on randomize flags in CI (#7824). We solved a bunch of flakes with that. In this PR, I explore running tests in parallel. The hope here is to illustrate which tests may be sharing resources when they shouldn't. The goal originally was to ensure that tests don't share resources (except for our kube tests), be able to be run in parallel. This would establish healthy test patterns in the codebase, that other developers would follow, and we would avoid more flakes in the future. Below I outline what conclusion I came to.
Outcome: We decided not to run tests in parallel, but track the effort in https://github.com/solo-io/solo-projects/issues/5147
Remaining Serial Tests
Most tests were able to be run in parallel. There are some tests that use services (consul, vault) whose ports are currently hard-coded to a single value. This means those tests cannot run in parallel. As a result, we decorate these tests with the decorators to notify Ginkgo that the tests should not be run in parallel.
Potential Concern
Every now and then I see all tests pass, but Ginkgo still reports a failure. Examle:
I've googled this and searched open Ginkgo issues and haven't found any leads around this.
Why Not to Enable Parallel Tests (yet)
This PR updates the test capabilities to enable them to run in parallel. However, for some reasons outlined below, we opt not to enable this (yet):
Concurrency Values
In the cloudbuild (run-tests.yaml) definition, we use varying levels of concurrency. These values were chosen to try to get each of the 3 steps to try to complete in the same amount of time. We can add more concurrency factor to the e2e tests, and that will speed the tests up more, but if the unit tests are the slowest, then build-bot will still wait for those unit tests to finish.
I ultimately chose the following p values:
unit tests : p=4
e2e tests: p=3
These were just based on running with varying values to get these two groups of tests to take the same amount of time.
Performance Improvement
Build-bot now executes in ~12 minutes (latest logs), where we use varying levels of concurrency to get unit tests and e2e tests to take around the same time to complete (10 minutes).
The last PR that merged to the main branch, required 16 minutes for build-bot to complete .
This is a 25% gain
Enterprise Support
https://github.com/solo-io/solo-projects/pull/5144
Other Changes
docker.go
: We had this file in Enterprise, and so I moved it to Open Source to follow the pattern of defining shared resources in open source, instead of having unique enterprise test logicChecklist:
make -B install-go-tools generated-code
to ensure there will be no code diffBOT NOTES:
resolves https://github.com/solo-io/solo-projects/issues/5143