Stability: Support Running Tests in Parallel #8423

sam-heilbron · 2023-06-27T15:48:16Z

Description

Explore test parallelization to identify other potential sources of flakes

Context

Tests that are subtly dependent on other tests or environment conditions, can lead to test pollution and flakes. To expose this, a while ago we turned on randomize flags in CI (#7824). We solved a bunch of flakes with that. In this PR, I explore running tests in parallel. The hope here is to illustrate which tests may be sharing resources when they shouldn't. The goal originally was to ensure that tests don't share resources (except for our kube tests), be able to be run in parallel. This would establish healthy test patterns in the codebase, that other developers would follow, and we would avoid more flakes in the future. Below I outline what conclusion I came to.

Outcome: We decided not to run tests in parallel, but track the effort in https://github.com/solo-io/solo-projects/issues/5147

Remaining Serial Tests

Most tests were able to be run in parallel. There are some tests that use services (consul, vault) whose ports are currently hard-coded to a single value. This means those tests cannot run in parallel. As a result, we decorate these tests with the decorators to notify Ginkgo that the tests should not be run in parallel.

Potential Concern

Every now and then I see all tests pass, but Ginkgo still reports a failure. Examle:

Step #1 - "run-tests": Step #8 - "run-e2e-tests": �[38;5;10m�[1mRan 250 of 250 Specs in 414.734 seconds�[0m
Step #1 - "run-tests": Step #8 - "run-e2e-tests": �[38;5;10m�[1mSUCCESS!�[0m -- �[38;5;10m�[1m250 Passed�[0m | �[38;5;9m�[1m0 Failed�[0m | �[38;5;11m�[1m0 Pending�[0m | �[38;5;14m�[1m0 Skipped�[0m
Step #1 - "run-tests": Step #8 - "run-e2e-tests":
Step #1 - "run-tests": Step #8 - "run-e2e-tests": 
Step #1 - "run-tests": Step #8 - "run-e2e-tests": Ginkgo ran 1 suite in 8m55.783719337s
Step #1 - "run-tests": Step #8 - "run-e2e-tests": 
Step #1 - "run-tests": Step #8 - "run-e2e-tests": Test Suite Failed
Step #1 - "run-tests": Step #8 - "run-e2e-tests": make: *** [Makefile:163: test] Error 1

I've googled this and searched open Ginkgo issues and haven't found any leads around this.

Why Not to Enable Parallel Tests (yet)

This PR updates the test capabilities to enable them to run in parallel. However, for some reasons outlined below, we opt not to enable this (yet):

Sometimes tests passes but Ginkgo reports failure
Build-bot tests run faster with these changes, but Kube tests are still 10 minutes slower, so optimizing for time at the risk of consistency isn't worth it

Concurrency Values

In the cloudbuild (run-tests.yaml) definition, we use varying levels of concurrency. These values were chosen to try to get each of the 3 steps to try to complete in the same amount of time. We can add more concurrency factor to the e2e tests, and that will speed the tests up more, but if the unit tests are the slowest, then build-bot will still wait for those unit tests to finish.

I ultimately chose the following p values:
unit tests : p=4
e2e tests: p=3

These were just based on running with varying values to get these two groups of tests to take the same amount of time.

Performance Improvement

Build-bot now executes in ~12 minutes (latest logs), where we use varying levels of concurrency to get unit tests and e2e tests to take around the same time to complete (10 minutes).

The last PR that merged to the main branch, required 16 minutes for build-bot to complete .

This is a 25% gain

Enterprise Support

https://github.com/solo-io/solo-projects/pull/5144

Other Changes

docker.go: We had this file in Enterprise, and so I moved it to Open Source to follow the pattern of defining shared resources in open source, instead of having unique enterprise test logic

Checklist:

I included a concise, user-facing changelog (for details, see https://github.com/solo-io/go-utils/tree/main/changelogutils) which references the issue that is resolved.
If I updated APIs (our protos) or helm values, I ran make -B install-go-tools generated-code to ensure there will be no code diff
I followed guidelines laid out in the Gloo Edge contribution guide
I opened a draft PR or added the work in progress label if my PR is not ready for review
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works

BOT NOTES:
resolves https://github.com/solo-io/solo-projects/issues/5143

htiting docker daemon issues after ~15 successful tests

…loo into stability/parallel-tests

solo-changelog-bot · 2023-06-28T14:15:04Z

Issues linked to changelog:
https://github.com/solo-io/solo-projects/issues/5143

This reverts commit dad4783.

bewebi

Broadly looks good, but I'm not clear what the value is in merging this PR if we can't use it in CI and don't recommend using it locally
Is the idea that we aren't pursuing either at the moment and want to commit what we have before returning to those pieces later? If so, can we open follow-up issues and link them to this PR?

Before the commit disable parallel tests in CI you can see that running in parallel in CI was successful.

This will be a bit hard to find if/when we're ready to enable parallel tests in CI
Perhaps these values can be committed somewhere, or at least written directly in the PR description

There are some tests that use services (consul, vault) whose ports are currently hard-coded to a single value.

Why can't we assign hose dynamically to available ports as we do elsewhere in this PR? I guess because those services are set up at cluster creation time? Would there be any benefit, presumably in a followup PR, to expose these services on multiple ports to enable those tests to be run in parallel as well?

test/ginkgo/parallel/ports.go

test/services/docker.go

bewebi · 2023-06-30T15:19:19Z

test/ginkgo/parallel/ports.go

+// It relies on a hard-coded denylist, of ports that we know are hard-coded in Gloo
+// And will cause tests to fail if they attempt to bind on the same port
+// If you need a more advanced port selection mechanism, use AdvancePortSafe
+func AdvancePortSafeDenylist(p *uint32) uint32 {


It seems like this isn't actually used
What is the justification for keeping this implementation in code?

This was the first iteration of my work, and I actually found it easier to write a unit test that the retry logic (to find a new port) was working as epxected. It pointed out a bug in the AdvancePortSafe logic, so I found this useful. I agree that this isn't super useful, but it is nice to have a test case for it. I guess, one option would be to use this in the unit tests (and define it there directly). That way its not exported in this file, but still can be useful. What do you think about that?

Seems reasonable

bewebi · 2023-06-30T15:22:15Z

projects/gloo/cli/pkg/cmd/add/add_test.go

@@ -12,7 +12,7 @@ import (
 	"github.com/solo-io/gloo/test/testutils"
 )

-var _ = Describe("Add", func() {
+var _ = Describe("Add", Serial, func() {


Can we add comments explaining the use of Serial everywhere we use it?

This is done inconsistently in the PR (for example here)

Good call! Reached out in Slack if you think the Serial is useful, given that we don't turn on tests in parallel. If we move forward with it, Ill be sure to add comments to document each usage of it

sam-heilbron · 2023-06-30T16:01:07Z

Broadly looks good, but I'm not clear what the value is in merging this PR if we can't use it in CI and don't recommend using it locally Is the idea that we aren't pursuing either at the moment and want to commit what we have before returning to those pieces later? If so, can we open follow-up issues and link them to this PR?

Yeah it's a good point. I think that running non-e2e tests in parallel can be useful locally, and may speed up local development. For example, I change a helm value and want to run all the helm unit tests. My hope was that the Serial labels will show which tests cannot be run serially. so if Ginkgo fixes the reporting using parallel processes, we can turn it on more broadly.

Before the commit disable parallel tests in CI you can see that running in parallel in CI was successful.

This will be a bit hard to find if/when we're ready to enable parallel tests in CI Perhaps these values can be committed somewhere, or at least written directly in the PR description

Yeah good call. What I was trying to get at was less the values that I used for concurrency, and more the demonstration that when run in CI using concurrency it passed. I will update the PR to call that out better.

There are some tests that use services (consul, vault) whose ports are currently hard-coded to a single value.

Why can't we assign hose dynamically to available ports as we do elsewhere in this PR? I guess because those services are set up at cluster creation time? Would there be any benefit, presumably in a followup PR, to expose these services on multiple ports to enable those tests to be run in parallel as well?

We could explore that, but I decided against fixing all the cases of code not handling port rotation properly. I fiugred that would be something we could explore if/when we needed to.

Co-authored-by: Bernie Birnbaum <bewebi@earthlink.net>

…loo into stability/parallel-tests

inFocus7

🚀

sam-heilbron added 2 commits June 27, 2023 09:45

add docker/unique container name in envoy, e2e tests running in parallel

e52afaf

htiting docker daemon issues after ~15 successful tests

include multi-process in ci, for e2e tests only

480be12

github-actions bot added the keep pr updated signals bulldozer to keep pr up to date with base branch label Jun 27, 2023

soloio-bulldozer bot and others added 24 commits June 27, 2023 15:48

Merge main into stability/parallel-tests

09b9d5d

fix zipkin_test port re-use

b380b43

Merge branch 'stability/parallel-tests' of ssh://github.com/solo-io/g…

548dfc2

…loo into stability/parallel-tests

fix consul_test port re-use

da90726

run tests in parallel, not just e2e tests

080413f

consul/vault serial

36685b3

move serial decorator

16bac74

apply serial decorator to vaut tests

105750a

serial ratelimit tests

01cc691

rotate port for vault

3185033

advnace by 2

935952b

serial vault test

373de64

handle sychronized managmenet of install assets

5ddff53

handle sychronized managmenet of install assets

1dde0a2

increase parallel factor,serial consul tests'

4ee07f2

manage log level in ordered test

2ba3924

set log level per test

28f6e60

groutine leak detector doesn't play with parallel test

bbe423e

use shared advance port

5a4c9c1

serial sds tests

2c190e3

update concurrency patterns in ci

d1e930f

staged transofmration tests no longer need to be ordered

cd88680

add changelog

c1fa5f4

update readme

3e98908

sam-heilbron added 2 commits June 28, 2023 08:32

Retry Consul/Vault run

dad4783

Revert "Retry Consul/Vault run"

dc91b59

This reverts commit dad4783.

sam-heilbron added 5 commits June 28, 2023 12:26

portInUseListen support

c5b6421

update docker.go

56521a6

cleanup testutils.env

653abc6

fix race in aws-test

ed49962

more cleanup for local parallel tests

fb642dc

sam-heilbron changed the title ~~Stability: Run Tests in Parallel~~ Stability: Prepare Tests to Run in Parallel Jun 29, 2023

disable parallel tests in CI

7b92845

sam-heilbron changed the title ~~Stability: Prepare Tests to Run in Parallel~~ Stability: Support Running Tests in Parallel Jun 29, 2023

sam-heilbron marked this pull request as ready for review June 29, 2023 16:44

sam-heilbron added 2 commits June 29, 2023 10:44

update changelog

cdd5e87

fix typo

07f4a88

sam-heilbron requested review from bewebi, inFocus7 and jbohanon June 30, 2023 13:25

bewebi reviewed Jun 30, 2023

View reviewed changes

sam-heilbron and others added 8 commits June 30, 2023 10:31

shared decorators

d6b2c0b

Apply suggestions from code review

58cacde

Co-authored-by: Bernie Birnbaum <bewebi@earthlink.net>

Merge branch 'stability/parallel-tests' of ssh://github.com/solo-io/g…

617e929

…loo into stability/parallel-tests

move unused function into test

9996651

add back leak detection

e1a73af

remove duplicate comment

e3463d7

goimports -w .

cf135e3

remove ginkgo parallel flag in makefile

6d8479c

sam-heilbron requested a review from bewebi June 30, 2023 18:49

Merge branch 'main' into stability/parallel-tests

ab751ca

bewebi approved these changes Jun 30, 2023

View reviewed changes

inFocus7 approved these changes Jun 30, 2023

View reviewed changes

soloio-bulldozer bot merged commit f9ee93d into main Jun 30, 2023
11 checks passed

soloio-bulldozer bot deleted the stability/parallel-tests branch June 30, 2023 19:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stability: Support Running Tests in Parallel #8423

Stability: Support Running Tests in Parallel #8423

sam-heilbron commented Jun 27, 2023 •

edited by solo-changelog-bot bot

Loading

solo-changelog-bot bot commented Jun 28, 2023

bewebi left a comment

bewebi Jun 30, 2023

sam-heilbron Jun 30, 2023

bewebi Jun 30, 2023

bewebi Jun 30, 2023

sam-heilbron Jun 30, 2023

sam-heilbron commented Jun 30, 2023

inFocus7 left a comment

Stability: Support Running Tests in Parallel #8423

Stability: Support Running Tests in Parallel #8423

Conversation

sam-heilbron commented Jun 27, 2023 • edited by solo-changelog-bot bot Loading

Description

Context

Remaining Serial Tests

Potential Concern

Why Not to Enable Parallel Tests (yet)

Concurrency Values

Performance Improvement

Enterprise Support

Other Changes

Checklist:

solo-changelog-bot bot commented Jun 28, 2023

bewebi left a comment

Choose a reason for hiding this comment

bewebi Jun 30, 2023

Choose a reason for hiding this comment

sam-heilbron Jun 30, 2023

Choose a reason for hiding this comment

bewebi Jun 30, 2023

Choose a reason for hiding this comment

bewebi Jun 30, 2023

Choose a reason for hiding this comment

sam-heilbron Jun 30, 2023

Choose a reason for hiding this comment

sam-heilbron commented Jun 30, 2023

inFocus7 left a comment

Choose a reason for hiding this comment

sam-heilbron commented Jun 27, 2023 •

edited by solo-changelog-bot bot

Loading