Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stability: Support Running Tests in Parallel #8423

Merged
merged 49 commits into from
Jun 30, 2023

Conversation

sam-heilbron
Copy link
Contributor

@sam-heilbron sam-heilbron commented Jun 27, 2023

Description

Explore test parallelization to identify other potential sources of flakes

Context

Tests that are subtly dependent on other tests or environment conditions, can lead to test pollution and flakes. To expose this, a while ago we turned on randomize flags in CI (#7824). We solved a bunch of flakes with that. In this PR, I explore running tests in parallel. The hope here is to illustrate which tests may be sharing resources when they shouldn't. The goal originally was to ensure that tests don't share resources (except for our kube tests), be able to be run in parallel. This would establish healthy test patterns in the codebase, that other developers would follow, and we would avoid more flakes in the future. Below I outline what conclusion I came to.

Outcome: We decided not to run tests in parallel, but track the effort in https://github.com/solo-io/solo-projects/issues/5147

Remaining Serial Tests

Most tests were able to be run in parallel. There are some tests that use services (consul, vault) whose ports are currently hard-coded to a single value. This means those tests cannot run in parallel. As a result, we decorate these tests with the decorators to notify Ginkgo that the tests should not be run in parallel.

Potential Concern

Every now and then I see all tests pass, but Ginkgo still reports a failure. Examle:

Step #1 - "run-tests": Step #8 - "run-e2e-tests": �[38;5;10m�[1mRan 250 of 250 Specs in 414.734 seconds�[0m
Step #1 - "run-tests": Step #8 - "run-e2e-tests": �[38;5;10m�[1mSUCCESS!�[0m -- �[38;5;10m�[1m250 Passed�[0m | �[38;5;9m�[1m0 Failed�[0m | �[38;5;11m�[1m0 Pending�[0m | �[38;5;14m�[1m0 Skipped�[0m
Step #1 - "run-tests": Step #8 - "run-e2e-tests":
Step #1 - "run-tests": Step #8 - "run-e2e-tests": 
Step #1 - "run-tests": Step #8 - "run-e2e-tests": Ginkgo ran 1 suite in 8m55.783719337s
Step #1 - "run-tests": Step #8 - "run-e2e-tests": 
Step #1 - "run-tests": Step #8 - "run-e2e-tests": Test Suite Failed
Step #1 - "run-tests": Step #8 - "run-e2e-tests": make: *** [Makefile:163: test] Error 1

I've googled this and searched open Ginkgo issues and haven't found any leads around this.

Why Not to Enable Parallel Tests (yet)

This PR updates the test capabilities to enable them to run in parallel. However, for some reasons outlined below, we opt not to enable this (yet):

  • Sometimes tests passes but Ginkgo reports failure
  • Build-bot tests run faster with these changes, but Kube tests are still 10 minutes slower, so optimizing for time at the risk of consistency isn't worth it

Concurrency Values

In the cloudbuild (run-tests.yaml) definition, we use varying levels of concurrency. These values were chosen to try to get each of the 3 steps to try to complete in the same amount of time. We can add more concurrency factor to the e2e tests, and that will speed the tests up more, but if the unit tests are the slowest, then build-bot will still wait for those unit tests to finish.

I ultimately chose the following p values:
unit tests : p=4
e2e tests: p=3

These were just based on running with varying values to get these two groups of tests to take the same amount of time.

Performance Improvement

Build-bot now executes in ~12 minutes (latest logs), where we use varying levels of concurrency to get unit tests and e2e tests to take around the same time to complete (10 minutes).

The last PR that merged to the main branch, required 16 minutes for build-bot to complete .

This is a 25% gain

Enterprise Support

https://github.com/solo-io/solo-projects/pull/5144

Other Changes

docker.go: We had this file in Enterprise, and so I moved it to Open Source to follow the pattern of defining shared resources in open source, instead of having unique enterprise test logic

Checklist:

  • I included a concise, user-facing changelog (for details, see https://github.com/solo-io/go-utils/tree/main/changelogutils) which references the issue that is resolved.
  • If I updated APIs (our protos) or helm values, I ran make -B install-go-tools generated-code to ensure there will be no code diff
  • I followed guidelines laid out in the Gloo Edge contribution guide
  • I opened a draft PR or added the work in progress label if my PR is not ready for review
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works

BOT NOTES:
resolves https://github.com/solo-io/solo-projects/issues/5143

@github-actions github-actions bot added the keep pr updated signals bulldozer to keep pr up to date with base branch label Jun 27, 2023
@solo-changelog-bot
Copy link

Issues linked to changelog:
https://github.com/solo-io/solo-projects/issues/5143

@sam-heilbron sam-heilbron changed the title Stability: Run Tests in Parallel Stability: Prepare Tests to Run in Parallel Jun 29, 2023
@sam-heilbron sam-heilbron changed the title Stability: Prepare Tests to Run in Parallel Stability: Support Running Tests in Parallel Jun 29, 2023
@sam-heilbron sam-heilbron marked this pull request as ready for review June 29, 2023 16:44
Copy link
Contributor

@bewebi bewebi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly looks good, but I'm not clear what the value is in merging this PR if we can't use it in CI and don't recommend using it locally
Is the idea that we aren't pursuing either at the moment and want to commit what we have before returning to those pieces later? If so, can we open follow-up issues and link them to this PR?

Before the commit disable parallel tests in CI you can see that running in parallel in CI was successful.

This will be a bit hard to find if/when we're ready to enable parallel tests in CI
Perhaps these values can be committed somewhere, or at least written directly in the PR description

There are some tests that use services (consul, vault) whose ports are currently hard-coded to a single value.

Why can't we assign hose dynamically to available ports as we do elsewhere in this PR? I guess because those services are set up at cluster creation time? Would there be any benefit, presumably in a followup PR, to expose these services on multiple ports to enable those tests to be run in parallel as well?

test/ginkgo/parallel/ports.go Outdated Show resolved Hide resolved
test/services/docker.go Outdated Show resolved Hide resolved
test/services/docker.go Show resolved Hide resolved
// It relies on a hard-coded denylist, of ports that we know are hard-coded in Gloo
// And will cause tests to fail if they attempt to bind on the same port
// If you need a more advanced port selection mechanism, use AdvancePortSafe
func AdvancePortSafeDenylist(p *uint32) uint32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this isn't actually used
What is the justification for keeping this implementation in code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the first iteration of my work, and I actually found it easier to write a unit test that the retry logic (to find a new port) was working as epxected. It pointed out a bug in the AdvancePortSafe logic, so I found this useful. I agree that this isn't super useful, but it is nice to have a test case for it. I guess, one option would be to use this in the unit tests (and define it there directly). That way its not exported in this file, but still can be useful. What do you think about that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable

@@ -12,7 +12,7 @@ import (
"github.com/solo-io/gloo/test/testutils"
)

var _ = Describe("Add", func() {
var _ = Describe("Add", Serial, func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add comments explaining the use of Serial everywhere we use it?

This is done inconsistently in the PR (for example here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Reached out in Slack if you think the Serial is useful, given that we don't turn on tests in parallel. If we move forward with it, Ill be sure to add comments to document each usage of it

@sam-heilbron
Copy link
Contributor Author

Broadly looks good, but I'm not clear what the value is in merging this PR if we can't use it in CI and don't recommend using it locally Is the idea that we aren't pursuing either at the moment and want to commit what we have before returning to those pieces later? If so, can we open follow-up issues and link them to this PR?

Yeah it's a good point. I think that running non-e2e tests in parallel can be useful locally, and may speed up local development. For example, I change a helm value and want to run all the helm unit tests. My hope was that the Serial labels will show which tests cannot be run serially. so if Ginkgo fixes the reporting using parallel processes, we can turn it on more broadly.

Before the commit disable parallel tests in CI you can see that running in parallel in CI was successful.

This will be a bit hard to find if/when we're ready to enable parallel tests in CI Perhaps these values can be committed somewhere, or at least written directly in the PR description

Yeah good call. What I was trying to get at was less the values that I used for concurrency, and more the demonstration that when run in CI using concurrency it passed. I will update the PR to call that out better.

There are some tests that use services (consul, vault) whose ports are currently hard-coded to a single value.

Why can't we assign hose dynamically to available ports as we do elsewhere in this PR? I guess because those services are set up at cluster creation time? Would there be any benefit, presumably in a followup PR, to expose these services on multiple ports to enable those tests to be run in parallel as well?

We could explore that, but I decided against fixing all the cases of code not handling port rotation properly. I fiugred that would be something we could explore if/when we needed to.

@sam-heilbron sam-heilbron requested a review from bewebi June 30, 2023 18:49
Copy link
Contributor

@inFocus7 inFocus7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@soloio-bulldozer soloio-bulldozer bot merged commit f9ee93d into main Jun 30, 2023
11 checks passed
@soloio-bulldozer soloio-bulldozer bot deleted the stability/parallel-tests branch June 30, 2023 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep pr updated signals bulldozer to keep pr up to date with base branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants