Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move cert loading failure assertion on client public key write #2655

Merged

Conversation

albertteoh
Copy link
Contributor

@albertteoh albertteoh commented Nov 26, 2020

Signed-off-by: albertteoh albert.teoh@logz.io

Which problem is this PR solving?

Short description of the changes

  • Local testing revealed that "Failed to load certificate" is logged every time; however some tests in CI are not logging this failure state, perhaps due to disk I/O variability.
  • Given the failure signal is unreliable, this PR removes the failure assertion and only performs an assertion identical to the waitUntil condition which should make this test more reliable.

@albertteoh albertteoh requested a review from a team as a code owner November 26, 2020 12:26
@codecov
Copy link

codecov bot commented Nov 26, 2020

Codecov Report

Merging #2655 (8758552) into master (adfdd6a) will increase coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2655      +/-   ##
==========================================
+ Coverage   95.09%   95.10%   +0.01%     
==========================================
  Files         213      213              
  Lines        9491     9491              
==========================================
+ Hits         9025     9026       +1     
  Misses        388      388              
+ Partials       78       77       -1     
Impacted Files Coverage Δ
plugin/storage/integration/integration.go 77.34% <0.00%> (-0.56%) ⬇️
cmd/query/app/server.go 90.16% <0.00%> (+1.63%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update adfdd6a...8758552. Read the comment docs.

objectiser
objectiser previously approved these changes Nov 26, 2020
Copy link
Contributor

@objectiser objectiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the comment on

// If mismatched keys are present in the cert, the "Failed to load certificate" error will be logged instead.
the "Failed to load certificate" shouldn't be generated anyway - but given the desired state (e.g. "Loaded modified certificate") is asserted, looks ok to remove this check.

@@ -99,15 +99,7 @@ func TestReload(t *testing.T) {
FilterField(zap.String("certificate", keyFile.Name())).Len() > 0
}, 100, time.Millisecond*200)

// Logged when the cert is modified with the client's public key due to
// a mismatch with the existing server private key.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds to me that if this doesn't always happen then we have a flaky test conditions that would result in some code branches not executed reliably. The assertion was meant to ensure that they are always executed. If we remove the assertion, it could make the test results more stable, but the code coverage less stable, which is just as bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, the usual reason for flaky test behavior is heavier CPU preemption in the shared CI environments. I see that the wait above is only for 200ms, it's quite easy for the run to hiccup for longer than that, I usually make sure the wait is at least a second. The ideal solution is to make the test conditions deterministic, i.e. if we certain that the condition will happen, then we can safely use longer waits, like 10sec, since it's unlikely the CI run will be CPU starved for that long while at the same time the long waits won't make the test run longer under normal circumstances (especially locally).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the logged error is expected, then the comment I referenced is incorrect. However there is another test that covers the mismatch case that does check for that logged error.

However agree that increasing the wait may have been a good first step.

Copy link
Contributor Author

@albertteoh albertteoh Nov 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wait above is 200ms x 100 iterations = 20sec, which is more than enough time I think.

@objectiser is right that there is existing line coverage for the public/private key mismatch case under the TestReload_err_cert_update test.

The problem as it stands is that we're making two file writes in succession then asserting on both states. If we do, however, want to have a stronger assertion on the behaviour of cert reloads, why don't we assert each state after each file write?

That is, move the failure assertion up, after writing the client's public key to the certFile; which should give enough time for the certWatcher to pick up this change in state.

That is:

	// Write the client's public key
	certData, err = ioutil.ReadFile(clientCert)
	require.NoError(t, err)
	err = syncWrite(certFile.Name(), certData, 0644)
	require.NoError(t, err)

        // Wait for the cert watcher to log an error due to mismatching private/public keys
	waitUntil(func() bool {
		// Logged when the cert is modified with the client's public key due to
		// a mismatch with the existing server private key.
		return logObserver.FilterMessage("Failed to load certificate").
			FilterField(zap.String("certificate", certFile.Name())).Len() > 0
	}, 100, time.Millisecond*200)

	assert.True(t, logObserver.
		FilterMessage("Failed to load certificate").
		FilterField(zap.String("certificate", certFile.Name())).Len() > 0,
		"Failed to find wanted logs. All logs: "+fmt.Sprint(logObserver.All()))

        // Write the client's private key
	keyData, err = ioutil.ReadFile(clientKey)
	require.NoError(t, err)
	err = syncWrite(keyFile.Name(), keyData, 0644)
	require.NoError(t, err)
        ... // wait for "Loaded modified certificate" as below

What do you think @yurishkuro @objectiser?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the main state this test is interested in is the "loaded modified certificate", so checking for the failure is not strictly necessary, doing it this way may help users examining the test logs that find the error, as they would see the test was anticipating it. So +1 from me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @objectiser & @yurishkuro, I've moved the failure assertion to after writing the client's public key as suggested.

@albertteoh albertteoh force-pushed the fix-flaky-testreload-take-3 branch from 1a24f20 to 21abf3d Compare November 27, 2020 04:23
@mergify mergify bot dismissed objectiser’s stale review November 27, 2020 04:24

Pull request has been modified.

albertteoh added 2 commits November 30, 2020 16:15
Signed-off-by: albertteoh <albert.teoh@logz.io>
Signed-off-by: albertteoh <albert.teoh@logz.io>
@albertteoh albertteoh force-pushed the fix-flaky-testreload-take-3 branch from 5a465f8 to 719c74b Compare November 30, 2020 05:15
@albertteoh albertteoh changed the title Remove assertion on unreliable cert loading failure signal Move cert loading failure assertion on client public key write Nov 30, 2020
assert.True(t, logObserver.
FilterMessage("Failed to load certificate").
FilterField(zap.String("certificate", certFile.Name())).Len() > 0,
"Failed to find wanted logs. All logs: "+fmt.Sprint(logObserver.All()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor point - could you differentiate this log message from the one at the end of this function - e.g. "Unable to locate 'Failed to load certificate' error in log"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, good pickup, @objectiser.

Signed-off-by: albertteoh <albert.teoh@logz.io>
@albertteoh albertteoh force-pushed the fix-flaky-testreload-take-3 branch from fdc4e3e to 58bfc88 Compare November 30, 2020 10:24
objectiser
objectiser previously approved these changes Nov 30, 2020
Signed-off-by: albertteoh <albert.teoh@logz.io>
@albertteoh albertteoh force-pushed the fix-flaky-testreload-take-3 branch from 796731a to b114ab2 Compare November 30, 2020 10:39
@mergify mergify bot dismissed objectiser’s stale review November 30, 2020 10:39

Pull request has been modified.

@objectiser objectiser merged commit 221ce20 into jaegertracing:master Nov 30, 2020
// Logged when the cert is reloaded with mismatching client public key and existing server private key.
return logObserver.FilterMessage("Failed to load certificate").
FilterField(zap.String("certificate", certFile.Name())).Len() > 0
}, 100, time.Millisecond*200)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of 100 * 200ms, I would do 2000 * 10ms - minimize the guaranteed delay, keep total delay the same (absolute # is not particularly important, 20s is probably overkill, but 10s would be fine, enough to outlast any scheduling hiccup).

assert.True(t, logObserver.
FilterMessage("Failed to load certificate").
FilterField(zap.String("certificate", certFile.Name())).Len() > 0,
"Unable to locate 'Failed to load certificate' in log. All logs: "+fmt.Sprint(logObserver.All()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: assert uses Printf syntax, you could do

"Unable to locate 'Failed to load certificate' in log. All logs: %v", logObserver.All())

albertteoh pushed a commit to albertteoh/jaeger that referenced this pull request Dec 1, 2020
Signed-off-by: albertteoh <albert.teoh@logz.io>
yurishkuro pushed a commit that referenced this pull request Dec 1, 2020
Signed-off-by: albertteoh <albert.teoh@logz.io>
@albertteoh albertteoh deleted the fix-flaky-testreload-take-3 branch May 26, 2021 10:37
@jpkrohling jpkrohling added this to the Release 1.23.0 milestone Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test pkg/config/tlscfg/TestReload (cert_watcher_test.go)
4 participants