-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optimized SSE2 routines for bottleneck functions #30
base: master
Are you sure you want to change the base?
Conversation
That's nice! Would ARM platform (SIMD, NEON) benefit from this? |
No, ARM won't benefit from this patch directly. The functions I added use x86 SSE2 intrinsics, and those are not portable. Platforms that don't support SSE2 will keep using the current "plain" implementation. Here's the compile-time dispatcher for one of the functions. Of course the vectorized algorithm could be ported to NEON intrinsics, and someone could add an |
Taking the tests out of the branch condition results in a 50% speedup with GCC when compiled with -O3.
This file implements the masking algorithm we will use in the SSE2 code. It demonstrates identical output when compared to the plain function. It's a lot slower than the regular routine when used on individual pixels, but is speedier when vectorized. Instead of branching, we use masks to composite the output values. The masks were found by breaking down the branch conditions into boolean operations, and repeatedly applying De Morgan's laws to simplify them. Since SSE has the 'andnot' operator, optimize for that form.
To convert the algorithms to SSE2, we need to know exactly which width and type of int we're dealing with. Make this an uint16_t; the type looks large enough for a counter that updates every frame. At 10 hz, it will take almost 2 hours for the counter to saturate; enough time to finally accept a static object.
Directly run two functions on the same input and check whether they give the same output; don't just rely on printing a few numbers to the screen and eyeballing the results.
This pull request provides SSE2 vectorized implementations of
alg_update_reference_frame()
andalg_noise_tune()
. Profiling with Callgrind on my Atom server showed that the first was the most expensive single function call. Rewriting it in branchless SSE2 code cuts Motion's load average roughly in half for me. Per-function benchmarking shows a speedup of around 2× foralg_update_reference_frame()
, and around 4× foralg_noise_tune()
. Results differ across hardware and compilers, but always show significant speedup.The plain functions have been lifted out of
alg.c
and placed in the newalg/
subdirectory, along with their SSE2 versions. Inalg.c
, the preprocessor chooses between including plain or SSE2 functions at compile time. This is perhaps not in line with the rest of the codebase, but made it possible to build a test harness around the functions that checks their correctness and does performance benchmarks. This harness can be found inalg/tests
.My website has a writeup of how I converted
alg_update_reference_frame()
to branchless code. It demonstrates how to derive the logic step by step, and is a kind of giant comment on the code. Hopefully it will help in reviewing the code for correctness.This code took quite some time to write, but probably deserves more real-world testing than it's had so far. All comments welcome!