You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue concerns adding benchmarking for the performance optimization that was introduced in release 3.0.1. See especially #30 for a thorough discussion. The short summary is that using a naive linear error propagation algorithm will run the following snippet
sum(ufloat(1, 0.1) for _ in range(n)).std_dev
in O(N^2) time whereas the lazy weight evaluation algorithm innovated for 3.0.1 runs it in O(N) time. I think this is one of the main technical innovations from this package that separates it from anything anyone would try to "whip up".
PR #262 refactors the code that implements this lazy weight evaluation algorithm. For this reason I think it is necessary that we set up benchmarking tests that we confirm we pass before and after merging #262. In fact, I've re-implemented the error propagation multiple times as I've drafted #262 and repeatedly have failed to realized the O(N) execution time. This demonstrates the importance of a benchmarking tests to help us ensure we don't accidently introduce a change that ruins the performance (as I already did accidentally during my drafts with just two lines in the wrong place).
I have the following simple benchmarking code
import platform
import psutil
import timeit
from uncertainties import ufloat
def get_system_info():
return {
'platform': platform.system(),
'platform-release': platform.release(),
'platform-version': platform.version(),
'architecture': platform.machine(),
'processor': platform.processor(),
'ram': str(round(psutil.virtual_memory().total / (1024.0 ** 3))) + " GB",
}
def ufloat_sum_benchmark(num):
str(sum(ufloat(1, 1) for _ in range(num)))
if __name__ == "__main__":
for key, value in get_system_info().items():
print(f'{key:17}: {value}')
for n in (10, 100, 1000, 10000, 100000):
print(f'### {n=} ###')
reps = int(100000/n)
t = timeit.timeit(lambda: ufloat_sum_benchmark(n), number=reps)
print(f' Test duration: {t:.2f} s, Repetitions: {reps}')
print(f' Average execution time: {t/reps:.4f} s')
On my system on master branch I get
platform : Windows
platform-release : 10
platform-version : 10.0.19045
architecture : AMD64
processor : Intel64 Family 6 Model 154 Stepping 4, GenuineIntel
ram : 16 GB
### n=10 ###
Test duration: 0.90 s, Repetitions: 10000
Average execution time: 0.0001 s
### n=100 ###
Test duration: 0.79 s, Repetitions: 1000
Average execution time: 0.0008 s
### n=1000 ###
Test duration: 0.95 s, Repetitions: 100
Average execution time: 0.0095 s
### n=10000 ###
Test duration: 0.99 s, Repetitions: 10
Average execution time: 0.0995 s
### n=100000 ###
Test duration: 0.71 s, Repetitions: 1
Average execution time: 0.7065 s
On feature/linear_combo_refactor I get
platform : Windows
platform-release : 10
platform-version : 10.0.19045
architecture : AMD64
processor : Intel64 Family 6 Model 154 Stepping 4, GenuineIntel
ram : 16 GB
### n=10 ###
Test duration: 2.40 s, Repetitions: 10000
Average execution time: 0.0002 s
### n=100 ###
Test duration: 2.12 s, Repetitions: 1000
Average execution time: 0.0021 s
### n=1000 ###
Test duration: 1.98 s, Repetitions: 100
Average execution time: 0.0198 s
### n=10000 ###
Test duration: 1.86 s, Repetitions: 10
Average execution time: 0.1860 s
### n=100000 ###
Test duration: 1.77 s, Repetitions: 1
Average execution time: 1.7692 s
So we see that the new code is still linear in time, but between 2x-3x slower than master. My guess is the new code could win back that factor of 2x-3x with some careful profiling but I haven't done that yet.
Anyways, the point of this issue is to help me answer the question: What should a benchmarking regression test look like? Perhaps I can run this code under these 5 or so durations and simply make sure the runtimes don't exceed thresholds that are say.... 3x more than the current master branch performance? I could also check that the runtime is O(N) to within a factor of say 0.5 - 2 or 0.25 - 4? I'm not sure the right way to set threshold to make sure (1) we catch regressions but (2) we don't have code unluckily failing because it fails the benchmark. Maybe there's a way to make the benchmarking tests "information only" so they don't cause CI to fail, but they do alert code reviewers? Thoughts? I've also asked this question to get even more help with this since it's new to me.
The text was updated successfully, but these errors were encountered:
- [x] Closes#274
- [x] Executed `pre-commit run --all-files` with no errors
- [x] The change is fully covered by automated unit tests
- [x] Documented in docs/ as appropriate
- [x] Added an entry to the CHANGES file
add a performance benchmark test. This test is important especially to
ensure #262 doesn't introduce a performance regression.
---------
Co-authored-by: andrewgsavage <andrewgsavage@gmail.com>
The issue concerns adding benchmarking for the performance optimization that was introduced in release 3.0.1. See especially #30 for a thorough discussion. The short summary is that using a naive linear error propagation algorithm will run the following snippet
in O(N^2) time whereas the lazy weight evaluation algorithm innovated for 3.0.1 runs it in O(N) time. I think this is one of the main technical innovations from this package that separates it from anything anyone would try to "whip up".
PR #262 refactors the code that implements this lazy weight evaluation algorithm. For this reason I think it is necessary that we set up benchmarking tests that we confirm we pass before and after merging #262. In fact, I've re-implemented the error propagation multiple times as I've drafted #262 and repeatedly have failed to realized the O(N) execution time. This demonstrates the importance of a benchmarking tests to help us ensure we don't accidently introduce a change that ruins the performance (as I already did accidentally during my drafts with just two lines in the wrong place).
I have the following simple benchmarking code
On my system on
master
branch I getOn
feature/linear_combo_refactor
I getSo we see that the new code is still linear in time, but between 2x-3x slower than
master
. My guess is the new code could win back that factor of 2x-3x with some careful profiling but I haven't done that yet.Anyways, the point of this issue is to help me answer the question: What should a benchmarking regression test look like? Perhaps I can run this code under these 5 or so durations and simply make sure the runtimes don't exceed thresholds that are say.... 3x more than the current
master
branch performance? I could also check that the runtime is O(N) to within a factor of say 0.5 - 2 or 0.25 - 4? I'm not sure the right way to set threshold to make sure (1) we catch regressions but (2) we don't have code unluckily failing because it fails the benchmark. Maybe there's a way to make the benchmarking tests "information only" so they don't cause CI to fail, but they do alert code reviewers? Thoughts? I've also asked this question to get even more help with this since it's new to me.The text was updated successfully, but these errors were encountered: