Add optimized SSE2 routines for bottleneck functions #30

aklomp · 2014-09-10T20:34:28Z

This pull request provides SSE2 vectorized implementations of alg_update_reference_frame() and alg_noise_tune(). Profiling with Callgrind on my Atom server showed that the first was the most expensive single function call. Rewriting it in branchless SSE2 code cuts Motion's load average roughly in half for me. Per-function benchmarking shows a speedup of around 2× for alg_update_reference_frame(), and around 4× for alg_noise_tune(). Results differ across hardware and compilers, but always show significant speedup.

The plain functions have been lifted out of alg.c and placed in the new alg/ subdirectory, along with their SSE2 versions. In alg.c, the preprocessor chooses between including plain or SSE2 functions at compile time. This is perhaps not in line with the rest of the codebase, but made it possible to build a test harness around the functions that checks their correctness and does performance benchmarks. This harness can be found in alg/tests.

My website has a writeup of how I converted alg_update_reference_frame() to branchless code. It demonstrates how to derive the logic step by step, and is a kind of giant comment on the code. Hopefully it will help in reviewing the code for correctness.

This code took quite some time to write, but probably deserves more real-world testing than it's had so far. All comments welcome!

tosiara · 2014-09-11T04:58:15Z

That's nice! Would ARM platform (SIMD, NEON) benefit from this?

aklomp · 2014-09-11T07:46:47Z

No, ARM won't benefit from this patch directly. The functions I added use x86 SSE2 intrinsics, and those are not portable. Platforms that don't support SSE2 will keep using the current "plain" implementation. Here's the compile-time dispatcher for one of the functions.

Of course the vectorized algorithm could be ported to NEON intrinsics, and someone could add an alg/alg_noise_tune.neon.c to the dispatcher.

Taking the tests out of the branch condition results in a 50% speedup with GCC when compiled with -O3.

This file implements the masking algorithm we will use in the SSE2 code. It demonstrates identical output when compared to the plain function. It's a lot slower than the regular routine when used on individual pixels, but is speedier when vectorized. Instead of branching, we use masks to composite the output values. The masks were found by breaking down the branch conditions into boolean operations, and repeatedly applying De Morgan's laws to simplify them. Since SSE has the 'andnot' operator, optimize for that form.

To convert the algorithms to SSE2, we need to know exactly which width and type of int we're dealing with. Make this an uint16_t; the type looks large enough for a counter that updates every frame. At 10 hz, it will take almost 2 hours for the counter to saturate; enough time to finally accept a static object.

Directly run two functions on the same input and check whether they give the same output; don't just rely on printing a few numbers to the screen and eyeballing the results.

aklomp added 3 commits September 9, 2014 20:53

alg_noise_tune: split off into separate file to enable testing

0c46495

alg_noise_tune: add test pattern framework

6f4a1ba

alg_noise_tune: change ints to unsigned for 10% speed improvement

b825c6e

aklomp force-pushed the sse branch from 668930f to 3856f13 Compare October 5, 2014 17:33

aklomp added 8 commits October 5, 2014 22:41

alg_noise_tune: add optimized SSE2 routine

bb321eb

alg_update_reference_frame: factor out existing function

316171c

alg_update_reference_frame: add test harness

77210ca

alg_update_reference_frame: factor test masks out of branch condition

61c7a30

Taking the tests out of the branch condition results in a 50% speedup with GCC when compiled with -O3.

alg_update_reference_frame: add optimized SSE2 routine

e28d698

alg_update_reference_frame: test harness: directly compare functions

282e571

Directly run two functions on the same input and check whether they give the same output; don't just rely on printing a few numbers to the screen and eyeballing the results.

aklomp force-pushed the sse branch from 3856f13 to 282e571 Compare October 5, 2014 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized SSE2 routines for bottleneck functions #30

Add optimized SSE2 routines for bottleneck functions #30

aklomp commented Sep 10, 2014

tosiara commented Sep 11, 2014

aklomp commented Sep 11, 2014

Add optimized SSE2 routines for bottleneck functions #30

Are you sure you want to change the base?

Add optimized SSE2 routines for bottleneck functions #30

Conversation

aklomp commented Sep 10, 2014

tosiara commented Sep 11, 2014

aklomp commented Sep 11, 2014