-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
profiler: TestStopLatency fails on nightly race detector tests #1294
Comments
The test needs to be re-evaluated as it should never have passed in the first place. See issue #1294.
I'm actually having a hard time reproducing this test failure locally: $ go test -run 'TestStopLatency' -count 100 -race ./profiler
ok gopkg.in/DataDog/dd-trace-go.v1/profiler 224.970s Anyway, what about just increasing the timeout to e.g. 500ms? This test is mostly trying to prove that stopping a profiler that was just started takes (much) less time than the configured profiling period. |
I see a few more failures locally, |
The timeout for the test was not realistic given the constraints of the Go runtime CPU profiler. Increase the timeout and add a second test case meant to exercise the latency of both stopping the profiling as well as stopping the profile upload. Fixes #1294
The timeout for the test was not realistic given the constraints of the Go runtime CPU profiler. Increase the timeout and add a second test case meant to exercise the latency of both stopping the profiling as well as stopping the profile upload. Fixes #1294
Is there any reasonable bound we can place on how long stopping the upload should take? It surprises me in those failures that stopping takes up to a second. |
Maybe the upload test could be changed to not depend on timing at all? We could just test that the
Me too. I'm not even sure what the root cause is for CircleCI to behave so extremely. Are we running the tests very differently? Maybe every pkg gets tested concurrently, causing big CPU spikes? Or are we getting hit by noisy neighbors? Or both? |
Right, we could remove the time check for uploading and just rely on the channel that keeps the HTTP handler from returning until the end of the test. Then we know that if Stop returns, the upload was stopped. As for the slowness, it could be inconsistency from the CircleCI environment. I'm mainly hoping this isn't a regression caused by making the profile collection happen concurrently in #1282. |
I do have a theory on the
|
PR #1297 attempted to fix the flakiness noted in issue #1294 by creating two seperate tests: one which runs a long profile to test the latency of stopping the profile, and another which runs short profiles and makes uploading hang indefinitely. However, the upload test had such a short profiling period that the inner select statement in (*profiler).collect could take several iterations to actually cancel due to the Go runtime randomly choosing select cases when multiple cases are ready. In addition, the "stop-profiler" case didn't actually test what it was intended to test since profiling doesn't actually run until one full profiling period has elapsed. Since the period was set to an hour, Stop was called withouth profiling actually started. Merge the two tests back into one. This brings us full-circle back to the original test, but with a more generous window on how long stopping should take and without relying on modifying internal functions to make the test work.
Great investigation. The channel selection theory makes sense to me. |
PR #1297 attempted to fix the flakiness noted in issue #1294 by creating two seperate tests: one which runs a long profile to test the latency of stopping the profile, and another which runs short profiles and makes uploading hang indefinitely. However, the upload test had such a short profiling period that the inner select statement in (*profiler).collect could take several iterations to actually cancel due to the Go runtime randomly choosing select cases when multiple cases are ready. In addition, the "stop-profiler" case didn't actually test what it was intended to test since profiling doesn't actually run until one full profiling period has elapsed. Since the period was set to an hour, Stop was called withouth profiling actually started. Merge the two tests back into one. This brings us full-circle back to the original test, but with a more generous window on how long stopping should take and without relying on modifying internal functions to make the test work.
#1307 should have fixed this, closing. |
Example failure:
The test timeout was bumped from 20ms to 200ms by #1282 after noticing this same failure, but even a 200ms timeout isn't enough. CPU profiling can take slightly over 200ms to stop for the following reason:
This can take 200ms + whatever time to serialize the CPU profile. I don't think #1282 broke the behavior so much as the test should have never passed in the first place.
The text was updated successfully, but these errors were encountered: