-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky Test - load-test (TestLog10kDPS) #9094
Comments
@open-telemetry/collector-contrib-approvers, @open-telemetry/collector-approvers, the load tests have always been a source of test failures for us. All we always do is increase the limits once we start hitting them more frequently. I don't remember ever seeing a true positive out of those tests. Are those tests providing value to us? Or are they just sources of noise? |
I agree that the current tests don't add a lot of value and are a frequent source of flakes. For something unrelated, I recently had a look at what the Rust compiler process is for this, which can maybe be interesting for this discussion. The process is as follows:
Reproducing this in our setup would need additional infra to run benchmarks and store historical data, so I don't think this is something we can do without significant effort, but still I think it's interesting to see what other projects do for inspiration. |
I remember they once caught a performance regression when the Protobuf dependency was updated to a new version and caused significant slow down. CNCF provided us with dedicated EC2 instances where we could run the perf tests in a more stable environment, but we never had time to work on this. |
Responding narrowly to the immediate issue - this specific test expectation was "fixed" in #9023, with the standard approach of raising the limit. |
@djaglowski it looks like that PR "fixed" the test named "filelog" within the Should I submit a PR to up the OTLP test limit and continue the discussion around a more proper solution in a different issue? |
@TylerHelmuth, you're right, I was too broad in my assertion there. I think what you've suggested makes sense. A larger discussion on these tests deserves its own issue, but we can reduce loosen these test constraints immediately. |
Ok I will do that. Should I go ahead an up all the values (memory and CPU) for all the tests in this suite? Would probably go a long way to making PRs have green checks instead of red Xs. |
The rule of thumb that has been followed so far is that when establishing the limits, set them at roughly 15% above observed values. We should probably stick to that until there's consensus on the larger question. |
Describe the bug
The TestLog10kDPS test within the load-test workflow is failing intermittently. Here are some examples:
Steps to reproduce
Run the load-test workflow
What did you expect to see?
The TestLog10kDPS tests passing unless code had been modified that would deprecate performance.
What did you see instead?
The TestLog10kDPS tests failing intermittently.
Additional context
There are other tests in load-test workflow that are failing, but I noticed TestLog10kDPS failing the most.
The text was updated successfully, but these errors were encountered: