Fix race condition when executing parallel tests, passing context down and back #292

phisco · 2023-07-01T12:26:49Z

This is another approach to solve the same problem as #265 and was developed with the help of @maruina.

Supersedes #291 as I think is a better approach w.r.t. the one proposed there.

This patch reduces the number of accesses to the context in the testEnv to only the ones that are strictly necessary and that should be synchronous. So, testEnv still contains a ctx, but is only to be used as a starting point and won't be updated except after the setup and finish phase.

According to the following logic:

If Test is used, the same logic of now applies. So the context is passed to all features being run sequentially and the most updated version is then returned to the caller.
If TestInParallel is used (and parallel tests are enabled), the context is first populated by the BeforeEachTest functions and then a dedicated child of the parent context is provided to each test running in parallel, which is then discarded and only the original parent context is passed to the AfterEachTest functions and back into the environment. This is actually fixing a race condition in the previous implementation of TestInParallel.
If t.Parallel is used to run one or more features either using Test or TestInParallel, the same logic above applies, given that each caller will get its own context back.

The only caveat is that the context is that the Finish phase will only be able to see the context from the Setup phase and not the one from the features themselves.

k8s-ci-robot · 2023-07-01T12:26:58Z

Hi @phisco. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vladimirvivien

@phisco, this looks promising. Thanks for jumping in.
I left couple of comments about returning context from env.Test() method.

Also, I would suggest to focus on the following location where the test is actually ran. I would do the followings

Change the name of the function param from t to new:

t.Run(s.Name, func(newT *testing.T) {

This is to doubly-ensure that the right t variable is captured in the closure. I would also do the same here.

vladimirvivien · 2023-07-02T17:38:05Z

pkg/env/env.go

@@ -313,8 +321,8 @@ func (e *testEnv) TestInParallel(t *testing.T, testFeatures ...types.Feature) {
 //
 // BeforeTest and AfterTest operations are executed before and after
 // the feature is tested respectively.
-func (e *testEnv) Test(t *testing.T, testFeatures ...types.Feature) {
-	e.processTests(t, false, testFeatures...)
+func (e *testEnv) Test(t *testing.T, testFeatures ...types.Feature) context.Context {


This is a breaking change isn't it? It looks additive so, it may not be. But we should check.

Also, the context coming out of Test is never used is it ? Is this change to allow multiple Test calls to chain together ?

You are right, we could actually omit it, I added it because I felt it could be useful for users to inspect the content of the context after calling it, but now that I think about it, it's more an implementation detail of testenv rather than of the Environment

@phisco funny thing is while researching the race condition issue, i ran into the need to inspect the context. However, I would lean toward not leaving the signature as is to avoid breakage.

Maybe a separate PR would be useful to update the signature of the Test and TestInParallel methods. That way, it's effect can be studied (or roll back if needed) with minimal disruption.

@vladimirvivien, I did some checks, adding a returned context to the signatures would not be breaking change, maybe some linter could complain, but it will build without any issue. However, not returning it, would mean there is no way to inspect the content of the ctx after a run as we were saying, which would mean that we would have no way to check the following test cases:

https://github.com/kubernetes-sigs/e2e-framework/pull/292/files#diff-6da391e4e882fa4bb01f054410d115230af7831328b51bcf92939c9bfae905aaL318-R320 already existing

https://github.com/kubernetes-sigs/e2e-framework/pull/292/files#diff-6da391e4e882fa4bb01f054410d115230af7831328b51bcf92939c9bfae905aaR374-R375 which I added, similar to the existing one

https://github.com/kubernetes-sigs/e2e-framework/pull/292/files#diff-6da391e4e882fa4bb01f054410d115230af7831328b51bcf92939c9bfae905aaR424-R425 which I added, similar to the existing one

So, I tried adding two internal testEnv.test/testInParallel functions that are returning the ctx, which are then called directly by their public versions Test/TestInParalle, explicitly discarding the context returned, so that we can use those in tests if we need to check the ctx afterward, but they won't be accessible to users.

I would probably suggest adding the return ctx to the interface, but I totally understand not wanting to change the public API at least as part of this PR, so this could be a good intermediate step while we take a decision around that.

harshanarayana · 2023-07-04T17:07:49Z

Thanks @phisco and @maruina for working through this fix. I will take a look at the before EOD tomorrow.

phisco · 2023-07-04T17:13:47Z

Thanks @harshanarayana, I'm on vacation, so I'm going to implement the changes suggested by @vladimirvivien by end of the week: avoid returning the context from Test and TesInParallel, and rename the two testing.T parameters. If you have any other request, let me know 🙏

phisco · 2023-07-09T09:27:14Z

@vladimirvivien @harshanarayana, I implemented the changes suggested, let me know what you think 🙏

This patch reduces the number of accesses to the context in the testEnv to only the ones that are strictly necessary and that should be synchronous. So, testEnv still contains a ctx, but is only to be used as a starting point and won't be updated except after the setup and finish phase. According to the following logic: - If `Test` is used, the same logic of now applies. So the context is passed to all features being run sequentially and the most updated version is then returned to the caller. - If `TestInParallel` is used (and parallel tests are enabled), the context is first populated by the `BeforeEachTest` functions and then a dedicated child of the parent context is provided to each test running in parallel, which is then discarded and only the original parent context is passed to the `AfterEachTest` functions and back into the environment. This is actually fixing a race condition in the previous implementation of `TestInParallel`. - If `t.Parallel` is used to run one or more features either using `Test` or `TestInParallel`, the same logic above applies, given that each caller will get its own context back. The only caveat is that the context is that the Finish phase will only be able to see the context from the Setup phase and not the one from the features themselves. Co-authored-by: Matteo Ruina <matteo.ruina@datadoghq.com> Signed-off-by: Philippe Scorsolini <p.scorsolini@gmail.com>

Signed-off-by: Philippe Scorsolini <p.scorsolini@gmail.com>

vladimirvivien · 2023-07-09T23:05:30Z

@phisco Apologies, I hadn't look at this all week.
I am Ok with returning the context for Test() and TestInParallel, and thank you for the insightful research.
Additive changes to an API is usually Ok and this one won't break previous code. Plus I think it will be useful to be able to get to the Context which means the result of a previous env.Test() can be chained with another one.

ShwethaKumbla · 2023-07-10T05:42:03Z

/ok-to-test

phisco · 2023-07-10T06:24:08Z

@vladimirvivien, no problem, I agree it's better to return the context, I dropped a57988a 🙏

maruina · 2023-07-11T11:08:37Z

@vladimirvivien what do you think it's missing for merging this PR?

harshanarayana · 2023-07-15T14:54:38Z

@phisco @maruina @vladimirvivien This is a wonderful change. I don't really have any change request for the changes pushed into this PR.

Thanks for adding some great tests to avoid future regressions of race conditions @phisco

vladimirvivien · 2023-07-23T12:24:03Z

Apologies for the delay (busy).
I am good with it.

/lgtm
/approve

k8s-ci-robot · 2023-07-23T12:24:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: phisco, vladimirvivien

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vladimirvivien]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 1, 2023

k8s-ci-robot requested review from cpanato and harshanarayana July 1, 2023 12:26

k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 1, 2023

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 1, 2023

phisco mentioned this pull request Jul 1, 2023

Fix race condition when executing parallel tests, keeping context #291

Closed

maruina mentioned this pull request Jul 1, 2023

Fix race condition when executing parallel tests #265

Closed

vladimirvivien reviewed Jul 2, 2023

View reviewed changes

phisco force-pushed the fix/tParallel-passing-almost-everywhere branch from e12794e to bbd1faf Compare July 9, 2023 09:26

phisco force-pushed the fix/tParallel-passing-almost-everywhere branch from bbd1faf to da401a8 Compare July 9, 2023 09:28

phisco and others added 2 commits July 9, 2023 11:28

chore: rename internal testing.T

25cc833

Signed-off-by: Philippe Scorsolini <p.scorsolini@gmail.com>

phisco force-pushed the fix/tParallel-passing-almost-everywhere branch from da401a8 to a57988a Compare July 9, 2023 09:28

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 10, 2023

phisco force-pushed the fix/tParallel-passing-almost-everywhere branch from a57988a to 25cc833 Compare July 10, 2023 06:23

k8s-ci-robot assigned vladimirvivien Jul 23, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 23, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 23, 2023

k8s-ci-robot merged commit 5fa0a64 into kubernetes-sigs:main Jul 23, 2023

phisco mentioned this pull request Jul 26, 2023

Race when using Before/AfterEachTest and t.Parallel() #258

Closed

This was referenced Aug 14, 2023

Release Ticket: v0.3.0 of e2e-framework #307

Closed

Data races around testing.T and test context #216

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition when executing parallel tests, passing context down and back #292

Fix race condition when executing parallel tests, passing context down and back #292

phisco commented Jul 1, 2023

k8s-ci-robot commented Jul 1, 2023

vladimirvivien left a comment

vladimirvivien Jul 2, 2023

vladimirvivien Jul 2, 2023

phisco Jul 2, 2023

vladimirvivien Jul 3, 2023

phisco Jul 9, 2023

harshanarayana commented Jul 4, 2023

phisco commented Jul 4, 2023

phisco commented Jul 9, 2023

vladimirvivien commented Jul 9, 2023

ShwethaKumbla commented Jul 10, 2023

phisco commented Jul 10, 2023

maruina commented Jul 11, 2023

harshanarayana commented Jul 15, 2023

vladimirvivien commented Jul 23, 2023

k8s-ci-robot commented Jul 23, 2023

Fix race condition when executing parallel tests, passing context down and back #292

Fix race condition when executing parallel tests, passing context down and back #292

Conversation

phisco commented Jul 1, 2023

k8s-ci-robot commented Jul 1, 2023

vladimirvivien left a comment

Choose a reason for hiding this comment

vladimirvivien Jul 2, 2023

Choose a reason for hiding this comment

vladimirvivien Jul 2, 2023

Choose a reason for hiding this comment

phisco Jul 2, 2023

Choose a reason for hiding this comment

vladimirvivien Jul 3, 2023

Choose a reason for hiding this comment

phisco Jul 9, 2023

Choose a reason for hiding this comment

harshanarayana commented Jul 4, 2023

phisco commented Jul 4, 2023

phisco commented Jul 9, 2023

vladimirvivien commented Jul 9, 2023

ShwethaKumbla commented Jul 10, 2023

phisco commented Jul 10, 2023

maruina commented Jul 11, 2023

harshanarayana commented Jul 15, 2023

vladimirvivien commented Jul 23, 2023

k8s-ci-robot commented Jul 23, 2023