Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimized SSE2 routines for bottleneck functions #30

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

aklomp
Copy link

@aklomp aklomp commented Sep 10, 2014

This pull request provides SSE2 vectorized implementations of alg_update_reference_frame() and alg_noise_tune(). Profiling with Callgrind on my Atom server showed that the first was the most expensive single function call. Rewriting it in branchless SSE2 code cuts Motion's load average roughly in half for me. Per-function benchmarking shows a speedup of around 2× for alg_update_reference_frame(), and around 4× for alg_noise_tune(). Results differ across hardware and compilers, but always show significant speedup.

The plain functions have been lifted out of alg.c and placed in the new alg/ subdirectory, along with their SSE2 versions. In alg.c, the preprocessor chooses between including plain or SSE2 functions at compile time. This is perhaps not in line with the rest of the codebase, but made it possible to build a test harness around the functions that checks their correctness and does performance benchmarks. This harness can be found in alg/tests.

My website has a writeup of how I converted alg_update_reference_frame() to branchless code. It demonstrates how to derive the logic step by step, and is a kind of giant comment on the code. Hopefully it will help in reviewing the code for correctness.

This code took quite some time to write, but probably deserves more real-world testing than it's had so far. All comments welcome!

@tosiara
Copy link

tosiara commented Sep 11, 2014

That's nice! Would ARM platform (SIMD, NEON) benefit from this?

@aklomp
Copy link
Author

aklomp commented Sep 11, 2014

No, ARM won't benefit from this patch directly. The functions I added use x86 SSE2 intrinsics, and those are not portable. Platforms that don't support SSE2 will keep using the current "plain" implementation. Here's the compile-time dispatcher for one of the functions.

Of course the vectorized algorithm could be ported to NEON intrinsics, and someone could add an alg/alg_noise_tune.neon.c to the dispatcher.

aklomp added 8 commits October 5, 2014 22:41
Taking the tests out of the branch condition results in a 50% speedup
with GCC when compiled with -O3.
This file implements the masking algorithm we will use in the SSE2 code.
It demonstrates identical output when compared to the plain function.
It's a lot slower than the regular routine when used on individual
pixels, but is speedier when vectorized.

Instead of branching, we use masks to composite the output values. The
masks were found by breaking down the branch conditions into boolean
operations, and repeatedly applying De Morgan's laws to simplify them.
Since SSE has the 'andnot' operator, optimize for that form.
To convert the algorithms to SSE2, we need to know exactly which width
and type of int we're dealing with. Make this an uint16_t; the type
looks large enough for a counter that updates every frame. At 10 hz, it
will take almost 2 hours for the counter to saturate; enough time to
finally accept a static object.
Directly run two functions on the same input and check whether they give
the same output; don't just rely on printing a few numbers to the screen
and eyeballing the results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants