Move cert loading failure assertion on client public key write #2655

albertteoh · 2020-11-26T12:26:07Z

Signed-off-by: albertteoh albert.teoh@logz.io

Which problem is this PR solving?

Fixes Flaky test pkg/config/tlscfg/TestReload (cert_watcher_test.go) #2644

Short description of the changes

Local testing revealed that "Failed to load certificate" is logged every time; however some tests in CI are not logging this failure state, perhaps due to disk I/O variability.
Given the failure signal is unreliable, this PR removes the failure assertion and only performs an assertion identical to the waitUntil condition which should make this test more reliable.

codecov · 2020-11-26T12:39:22Z

Codecov Report

Merging #2655 (8758552) into master (adfdd6a) will increase coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2655      +/-   ##
==========================================
+ Coverage   95.09%   95.10%   +0.01%     
==========================================
  Files         213      213              
  Lines        9491     9491              
==========================================
+ Hits         9025     9026       +1     
  Misses        388      388              
+ Partials       78       77       -1

Impacted Files	Coverage Δ
plugin/storage/integration/integration.go	`77.34% <0.00%> (-0.56%)`	⬇️
cmd/query/app/server.go	`90.16% <0.00%> (+1.63%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update adfdd6a...8758552. Read the comment docs.

objectiser

According to the comment on

jaeger/pkg/config/tlscfg/cert_watcher_test.go

Line 97 in 1a24f20

    
           // If mismatched keys are present in the cert, the "Failed to load certificate" error will be logged instead.

the "Failed to load certificate" shouldn't be generated anyway - but given the desired state (e.g. "Loaded modified certificate") is asserted, looks ok to remove this check.

yurishkuro · 2020-11-26T16:56:27Z

pkg/config/tlscfg/cert_watcher_test.go

@@ -99,15 +99,7 @@ func TestReload(t *testing.T) {
 			FilterField(zap.String("certificate", keyFile.Name())).Len() > 0
 	}, 100, time.Millisecond*200)

-	// Logged when the cert is modified with the client's public key due to
-	// a mismatch with the existing server private key.


It sounds to me that if this doesn't always happen then we have a flaky test conditions that would result in some code branches not executed reliably. The assertion was meant to ensure that they are always executed. If we remove the assertion, it could make the test results more stable, but the code coverage less stable, which is just as bad.

For the record, the usual reason for flaky test behavior is heavier CPU preemption in the shared CI environments. I see that the wait above is only for 200ms, it's quite easy for the run to hiccup for longer than that, I usually make sure the wait is at least a second. The ideal solution is to make the test conditions deterministic, i.e. if we certain that the condition will happen, then we can safely use longer waits, like 10sec, since it's unlikely the CI run will be CPU starved for that long while at the same time the long waits won't make the test run longer under normal circumstances (especially locally).

If the logged error is expected, then the comment I referenced is incorrect. However there is another test that covers the mismatch case that does check for that logged error.

However agree that increasing the wait may have been a good first step.

The wait above is 200ms x 100 iterations = 20sec, which is more than enough time I think.

@objectiser is right that there is existing line coverage for the public/private key mismatch case under the TestReload_err_cert_update test.

The problem as it stands is that we're making two file writes in succession then asserting on both states. If we do, however, want to have a stronger assertion on the behaviour of cert reloads, why don't we assert each state after each file write?

That is, move the failure assertion up, after writing the client's public key to the certFile; which should give enough time for the certWatcher to pick up this change in state.

That is:

// Write the client's public key certData, err = ioutil.ReadFile(clientCert) require.NoError(t, err) err = syncWrite(certFile.Name(), certData, 0644) require.NoError(t, err) // Wait for the cert watcher to log an error due to mismatching private/public keys waitUntil(func() bool { // Logged when the cert is modified with the client's public key due to // a mismatch with the existing server private key. return logObserver.FilterMessage("Failed to load certificate"). FilterField(zap.String("certificate", certFile.Name())).Len() > 0 }, 100, time.Millisecond*200) assert.True(t, logObserver. FilterMessage("Failed to load certificate"). FilterField(zap.String("certificate", certFile.Name())).Len() > 0, "Failed to find wanted logs. All logs: "+fmt.Sprint(logObserver.All())) // Write the client's private key keyData, err = ioutil.ReadFile(clientKey) require.NoError(t, err) err = syncWrite(keyFile.Name(), keyData, 0644) require.NoError(t, err) ... // wait for "Loaded modified certificate" as below

What do you think @yurishkuro @objectiser?

Although the main state this test is interested in is the "loaded modified certificate", so checking for the failure is not strictly necessary, doing it this way may help users examining the test logs that find the error, as they would see the test was anticipating it. So +1 from me.

Thanks for the feedback @objectiser & @yurishkuro, I've moved the failure assertion to after writing the client's public key as suggested.

Pull request has been modified.

Signed-off-by: albertteoh <albert.teoh@logz.io>

objectiser · 2020-11-30T10:13:01Z

pkg/config/tlscfg/cert_watcher_test.go

+	assert.True(t, logObserver.
+		FilterMessage("Failed to load certificate").
+		FilterField(zap.String("certificate", certFile.Name())).Len() > 0,
+		"Failed to find wanted logs. All logs: "+fmt.Sprint(logObserver.All()))


One minor point - could you differentiate this log message from the one at the end of this function - e.g. "Unable to locate 'Failed to load certificate' error in log"?

Yup, good pickup, @objectiser.

Signed-off-by: albertteoh <albert.teoh@logz.io>

Pull request has been modified.

yurishkuro · 2020-11-30T18:36:33Z

pkg/config/tlscfg/cert_watcher_test.go

+		// Logged when the cert is reloaded with mismatching client public key and existing server private key.
+		return logObserver.FilterMessage("Failed to load certificate").
+			FilterField(zap.String("certificate", certFile.Name())).Len() > 0
+	}, 100, time.Millisecond*200)


nit: instead of 100 * 200ms, I would do 2000 * 10ms - minimize the guaranteed delay, keep total delay the same (absolute # is not particularly important, 20s is probably overkill, but 10s would be fine, enough to outlast any scheduling hiccup).

yurishkuro · 2020-11-30T18:38:08Z

pkg/config/tlscfg/cert_watcher_test.go

+	assert.True(t, logObserver.
+		FilterMessage("Failed to load certificate").
+		FilterField(zap.String("certificate", certFile.Name())).Len() > 0,
+		"Unable to locate 'Failed to load certificate' in log. All logs: "+fmt.Sprint(logObserver.All()))


nit: assert uses Printf syntax, you could do

"Unable to locate 'Failed to load certificate' in log. All logs: %v", logObserver.All())

Signed-off-by: albertteoh <albert.teoh@logz.io>

albertteoh requested a review from a team as a code owner November 26, 2020 12:26

albertteoh requested a review from objectiser November 26, 2020 12:26

objectiser previously approved these changes Nov 26, 2020

View reviewed changes

yurishkuro reviewed Nov 26, 2020

View reviewed changes

albertteoh force-pushed the fix-flaky-testreload-take-3 branch from 1a24f20 to 21abf3d Compare November 27, 2020 04:23

albertteoh added 2 commits November 30, 2020 16:15

Remove assertion on unreliable cert loading failure signal

b479612

Signed-off-by: albertteoh <albert.teoh@logz.io>

Assert after each key write to cert

719c74b

Signed-off-by: albertteoh <albert.teoh@logz.io>

albertteoh force-pushed the fix-flaky-testreload-take-3 branch from 5a465f8 to 719c74b Compare November 30, 2020 05:15

albertteoh changed the title ~~Remove assertion on unreliable cert loading failure signal~~ Move cert loading failure assertion on client public key write Nov 30, 2020

objectiser reviewed Nov 30, 2020

View reviewed changes

Differentiate test failure messages

58bfc88

Signed-off-by: albertteoh <albert.teoh@logz.io>

albertteoh force-pushed the fix-flaky-testreload-take-3 branch from fdc4e3e to 58bfc88 Compare November 30, 2020 10:24

objectiser previously approved these changes Nov 30, 2020

View reviewed changes

Remove 'error' from test failure message

b114ab2

Signed-off-by: albertteoh <albert.teoh@logz.io>

albertteoh force-pushed the fix-flaky-testreload-take-3 branch from 796731a to b114ab2 Compare November 30, 2020 10:39

Merge branch 'master' into fix-flaky-testreload-take-3

8758552

objectiser approved these changes Nov 30, 2020

View reviewed changes

objectiser merged commit 221ce20 into jaegertracing:master Nov 30, 2020

yurishkuro reviewed Nov 30, 2020

View reviewed changes

albertteoh pushed a commit to albertteoh/jaeger that referenced this pull request Dec 1, 2020

Address review comments from jaegertracing#2655

d33beaf

Signed-off-by: albertteoh <albert.teoh@logz.io>

albertteoh mentioned this pull request Dec 1, 2020

Address review comments from #2655 #2673

Merged

yurishkuro pushed a commit that referenced this pull request Dec 1, 2020

Address review comments from #2655 (#2673)

761facc

Signed-off-by: albertteoh <albert.teoh@logz.io>

albertteoh deleted the fix-flaky-testreload-take-3 branch May 26, 2021 10:37

albertteoh mentioned this pull request May 26, 2021

Flaky test pkg/config/tlscfg/TestReload (cert_watcher_test.go) #2644

Closed

jpkrohling added this to the Release 1.23.0 milestone Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move cert loading failure assertion on client public key write #2655

Move cert loading failure assertion on client public key write #2655

albertteoh commented Nov 26, 2020 •

edited

Loading

codecov bot commented Nov 26, 2020 •

edited

Loading

objectiser left a comment

yurishkuro Nov 26, 2020

yurishkuro Nov 26, 2020

objectiser Nov 26, 2020

albertteoh Nov 27, 2020 •

edited

Loading

objectiser Nov 27, 2020

albertteoh Nov 30, 2020

objectiser Nov 30, 2020

albertteoh Nov 30, 2020

yurishkuro Nov 30, 2020

yurishkuro Nov 30, 2020

Move cert loading failure assertion on client public key write #2655

Move cert loading failure assertion on client public key write #2655

Conversation

albertteoh commented Nov 26, 2020 • edited Loading

Which problem is this PR solving?

Short description of the changes

codecov bot commented Nov 26, 2020 • edited Loading

Codecov Report

objectiser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertteoh Nov 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertteoh commented Nov 26, 2020 •

edited

Loading

codecov bot commented Nov 26, 2020 •

edited

Loading

albertteoh Nov 27, 2020 •

edited

Loading