Benchmarking #136
Replies: 8 comments
-
(I posted the following to #11 before I saw this. 🙂) Benchmarking is critical for weeding out less fruitful ideas early and other decision-making, as well as monitoring progress and communicating the merits of our work. So we will be running benchmarks frequently and want the workflow to be as low-overhead as possible. Relevant topics
Profiling is a different question. |
Beta Was this translation helpful? Give feedback.
-
Good thoughts!
|
Beta Was this translation helpful? Give feedback.
-
Another benchmark suite to consider is the pyston suite: https://github.com/pyston/python-macrobenchmarks/ |
Beta Was this translation helpful? Give feedback.
-
The impact of many of the optimizations we are pursuing (especially in the eval loop) is tied to various specific workloads, sometimes significantly. So it is important that we choose our target workloads conscientiously, and even document the rationale for the choices. In some cases it will also require that we add to our benchmark suite. That said, I do not think we need to focus much at first on the best target workloads, other than to let the idea simmer. We'll be fine for the moment with just the available suites, microbenchmarks and all. I'm sure that it won't take long before we build a stronger intuition for targeting specific workloads with our optimizations, at which point we can apply increasing discipline to our selection (of both benchmarks and optimization ideas). An iterative process like that will allow us to ramp up our effectiveness on this project. |
Beta Was this translation helpful? Give feedback.
-
FWIW, @zooba pointed me at https://github.com/Azure/azure-sdk-for-python/blob/master/doc/dev/perfstress_tests.md. This is a tool and framework for stress-testing the azure SDK. It isn't something we would use but does offer some insight into a different sort of benchmarking. There may be a lesson or two in there for us, if we don't have other things to look into. 🙂 |
Beta Was this translation helpful? Give feedback.
-
Ooh, cool. Maybe we could contact the author and ask them what they have learned. |
Beta Was this translation helpful? Give feedback.
-
Emery Berger has done some work on randomized benchmarking to remove a lot of systematic errors. |
Beta Was this translation helpful? Give feedback.
-
An interesting article on getting reliable benchmark results from a CI system (e.g. GitHub Actions): https://labs.quansight.org/blog/2021/08/github-actions-benchmarks/ |
Beta Was this translation helpful? Give feedback.
-
I'd like to have a benchmark so we have something concrete to target.
There are many benchmarks in PyPerformance and it runs for a long time. Some of the benchmarks are ancient (from the FORTRAN days) and focus on numeric array operations. I'm not interested in those (the users who have numeric arrays are all using numpy or a tensor package).
I like benchmarks that represent a more OO style of coding. (Note that even the "float" benchmark, which is supposed to measure some float operations including sqrt() and sin()/cos() was sped up by an improvement to the LOAD_ATTR opcode to speed up slots. :-) In PyPerformance there is a group that represent "apps" that we could use, or we could pick one of these.
There are also some benchmarks that the Pyston v2 project created: https://github.com/pyston/python-macrobenchmarks/ -- these would be interesting to try since they've got a somewhat similar goal as we do (keep the C/API unchanged) and they're farther along (claiming to be 20% faster) but they're closed source (for now).
For me, an important requirement is that a benchmark runs fairly quickly. If I have a benchmark that runs for a minute I'd probably be running it a lot to validate various tweaks I am experimenting with, even if I knew that the results were pretty noisy. OTOH if I only had a benchmark that ran for 15 minutes I'd probably run it only once or twice a day. If it ran for an hour I'd probably only run it overnight. We should probably run all of PyPerformance occasionally since it is used by the core dev team to validate whether a proposed speedup (a) does anything good for at least some of the benchmarks, and (b) doesn't slow anything down.
Beta Was this translation helpful? Give feedback.
All reactions