Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky TestAsyncRuleEvaluation #13547

Closed
bboreham opened this issue Feb 6, 2024 · 10 comments · Fixed by #14100
Closed

Flaky TestAsyncRuleEvaluation #13547

bboreham opened this issue Feb 6, 2024 · 10 comments · Fixed by #14100

Comments

@bboreham
Copy link
Member

bboreham commented Feb 6, 2024

Failed here: https://github.com/prometheus/prometheus/actions/runs/7797756253/job/21265037848?pr=13545

=== RUN   TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_insufficient_concurrency
    manager_test.go:1961: 
        	Error Trace:	D:/a/prometheus/prometheus/rules/manager_test.go:1961
        	Error:      	"0.0655291" is not less than "0.06"
        	Test:       	TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_insufficient_concurrency
--- FAIL: TestAsyncRuleEvaluation (0.18s)
    --- FAIL: TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_insufficient_concurrency (0.07s)

Passed on re-run: https://github.com/prometheus/prometheus/actions/runs/7797756253/job/21265977651?pr=13545

Also failed here: https://github.com/prometheus/prometheus/actions/runs/7725797136/job/21060788152?pr=13197

@darshanime
Copy link
Contributor

I think creating more rules for evaluation in fixtures/rules_multiple_independent.yaml might fix the flakeyness, since the time saved with concurrent evaluations would become more apparent. wdyt?

@bboreham
Copy link
Member Author

bboreham commented Feb 6, 2024

I haven't looked at the test at all.
Since it's only flaking on Windows, could it be the timer resolution again?

@darshanime
Copy link
Contributor

TIL about windows timer resolution being ~15ms!

My hypothesis is that the concurrency does not get to "kick in" with just 6 rules, and so if we increase the number of rules to evaluate, the time savings due to concurrency will get more marked.

check #13553

@bboreham
Copy link
Member Author

bboreham commented Feb 7, 2024

I just got this on my desktop Linux VM:

$ go test -tags stringlabels ./rules/
    --- FAIL: TestAsyncRuleEvaluation/asynchronous_evaluation_with_independent_and_dependent_rules (0.06s)
        manager_test.go:2391: 
                Error Trace:    /home/vagrant/src/github.com/grafana/mimir-prometheus/rules/manager_test.go:2391
                Error:          "0.056689165" is not less than "0.04"
                Test:           TestAsyncRuleEvaluation/asynchronous_evaluation_with_independent_and_dependent_rules

@darshanime
Copy link
Contributor

Huh, that test is flakey too? I think the same technique will help there as well. Updated

@zenador
Copy link
Contributor

zenador commented Feb 28, 2024

Flaked on https://github.com/prometheus/prometheus/actions/runs/8080900355/job/22078272296?pr=13662 (windows)

=== RUN   TestAsyncRuleEvaluation/asynchronous_evaluation_with_independent_and_dependent_rules
    manager_test.go:1926: 
        	Error Trace:	D:/a/prometheus/prometheus/rules/manager_test.go:1926
        	Error:      	"0.0411435" is not less than "0.04"
        	Test:       	TestAsyncRuleEvaluation/asynchronous_evaluation_with_independent_and_dependent_rules

and

https://github.com/prometheus/prometheus/actions/runs/8153560332/job/22285206086?pr=13706 (windows)

=== RUN   TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_sufficient_concurrency
    manager_test.go:2000: 
        	Error Trace:	D:/a/prometheus/prometheus/rules/manager_test.go:2000
        	Error:      	"0.06309" is not less than "0.06"
        	Test:       	TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_sufficient_concurrency

@bboreham
Copy link
Member Author

Looking at the code again, it seems to rely on a 10ms Sleep being reasonably close to 10ms.

time.Sleep(artificialDelay)

require.Less(t, time.Since(start).Seconds(), (time.Duration(ruleCount) * artificialDelay).Seconds())

This assumption is not safe: golang/go#44343 (but will be safer in Go 1.23).

The timings in flake reports above don't match neatly to the 15ms value on Windows, so I guess something else is going wrong.

@zenador
Copy link
Contributor

zenador commented May 14, 2024

Another example: https://github.com/prometheus/prometheus/actions/runs/9076113775/job/24938133905?pr=14098#step:4:13802

=== RUN   TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_insufficient_concurrency
    manager_test.go:1962: 
        	Error Trace:	D:/a/prometheus/prometheus/rules/manager_test.go:1962
        	Error:      	"0.07915" is not less than "0.06"
        	Test:       	TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_insufficient_concurrency

@machine424
Copy link
Collaborator

Unfortunately we're still seeing this on main

=== RUN   TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_insufficient_concurrency
    manager_test.go:1963: 
        	Error Trace:	D:/a/prometheus/prometheus/rules/manager_test.go:1963
        	Error:      	"0.8259941" is not less than "0.09"
        	Test:       	TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_insufficient_concurrency
=== RUN   TestAsyncRuleEvaluation/asynchronous_evaluation_of_all_independent_rules,_sufficient_concurrency
--- FAIL: TestAsyncRuleEvaluation (1.01s)

https://github.com/prometheus/prometheus/actions/runs/9178564530/job/25238754426#step:4:14073

@machine424 machine424 reopened this May 21, 2024
@machine424
Copy link
Collaborator

#14300 was merged for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants