Benchmarking tests #274

jagerber48 · 2024-12-16T12:59:48Z

The issue concerns adding benchmarking for the performance optimization that was introduced in release 3.0.1. See especially #30 for a thorough discussion. The short summary is that using a naive linear error propagation algorithm will run the following snippet

sum(ufloat(1, 0.1) for _ in range(n)).std_dev

in O(N^2) time whereas the lazy weight evaluation algorithm innovated for 3.0.1 runs it in O(N) time. I think this is one of the main technical innovations from this package that separates it from anything anyone would try to "whip up".

PR #262 refactors the code that implements this lazy weight evaluation algorithm. For this reason I think it is necessary that we set up benchmarking tests that we confirm we pass before and after merging #262. In fact, I've re-implemented the error propagation multiple times as I've drafted #262 and repeatedly have failed to realized the O(N) execution time. This demonstrates the importance of a benchmarking tests to help us ensure we don't accidently introduce a change that ruins the performance (as I already did accidentally during my drafts with just two lines in the wrong place).

I have the following simple benchmarking code

import platform
import psutil
import timeit

from uncertainties import ufloat


def get_system_info():
    return {
        'platform': platform.system(),
        'platform-release': platform.release(),
        'platform-version': platform.version(),
        'architecture': platform.machine(),
        'processor': platform.processor(),
        'ram': str(round(psutil.virtual_memory().total / (1024.0 ** 3))) + " GB",
    }



def ufloat_sum_benchmark(num):
    str(sum(ufloat(1, 1) for _ in range(num)))


if __name__ == "__main__":
    for key, value in get_system_info().items():
        print(f'{key:17}: {value}')

    for n in (10, 100, 1000, 10000, 100000):
        print(f'### {n=} ###')
        reps = int(100000/n)
        t = timeit.timeit(lambda: ufloat_sum_benchmark(n), number=reps)
        print(f'    Test duration: {t:.2f} s, Repetitions: {reps}')
        print(f'    Average execution time: {t/reps:.4f} s')

On my system on master branch I get

platform         : Windows
platform-release : 10
platform-version : 10.0.19045
architecture     : AMD64
processor        : Intel64 Family 6 Model 154 Stepping 4, GenuineIntel
ram              : 16 GB
### n=10 ###
    Test duration: 0.90 s, Repetitions: 10000
    Average execution time: 0.0001 s
### n=100 ###
    Test duration: 0.79 s, Repetitions: 1000
    Average execution time: 0.0008 s
### n=1000 ###
    Test duration: 0.95 s, Repetitions: 100
    Average execution time: 0.0095 s
### n=10000 ###
    Test duration: 0.99 s, Repetitions: 10
    Average execution time: 0.0995 s
### n=100000 ###
    Test duration: 0.71 s, Repetitions: 1
    Average execution time: 0.7065 s

On feature/linear_combo_refactor I get

platform         : Windows
platform-release : 10
platform-version : 10.0.19045
architecture     : AMD64
processor        : Intel64 Family 6 Model 154 Stepping 4, GenuineIntel
ram              : 16 GB
### n=10 ###
    Test duration: 2.40 s, Repetitions: 10000
    Average execution time: 0.0002 s
### n=100 ###
    Test duration: 2.12 s, Repetitions: 1000
    Average execution time: 0.0021 s
### n=1000 ###
    Test duration: 1.98 s, Repetitions: 100
    Average execution time: 0.0198 s
### n=10000 ###
    Test duration: 1.86 s, Repetitions: 10
    Average execution time: 0.1860 s
### n=100000 ###
    Test duration: 1.77 s, Repetitions: 1
    Average execution time: 1.7692 s

So we see that the new code is still linear in time, but between 2x-3x slower than master. My guess is the new code could win back that factor of 2x-3x with some careful profiling but I haven't done that yet.

Anyways, the point of this issue is to help me answer the question: What should a benchmarking regression test look like? Perhaps I can run this code under these 5 or so durations and simply make sure the runtimes don't exceed thresholds that are say.... 3x more than the current master branch performance? I could also check that the runtime is O(N) to within a factor of say 0.5 - 2 or 0.25 - 4? I'm not sure the right way to set threshold to make sure (1) we catch regressions but (2) we don't have code unluckily failing because it fails the benchmark. Maybe there's a way to make the benchmarking tests "information only" so they don't cause CI to fail, but they do alert code reviewers? Thoughts? I've also asked this question to get even more help with this since it's new to me.

The text was updated successfully, but these errors were encountered:

- [x] Closes #274 - [x] Executed `pre-commit run --all-files` with no errors - [x] The change is fully covered by automated unit tests - [x] Documented in docs/ as appropriate - [x] Added an entry to the CHANGES file add a performance benchmark test. This test is important especially to ensure #262 doesn't introduce a performance regression. --------- Co-authored-by: andrewgsavage <andrewgsavage@gmail.com>

jagerber48 mentioned this issue Dec 16, 2024

Add performance test #275

Merged

5 tasks

jagerber48 closed this as completed in #275 Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking tests #274

Benchmarking tests #274

jagerber48 commented Dec 16, 2024

Benchmarking tests #274

Benchmarking tests #274

Comments

jagerber48 commented Dec 16, 2024