Easy C/C++ SIMD in a single header file, supporting AArch64 NEON, x86_64 SSE2, and an emulated implementation
- Supports C99, and C++11 with extensive operator overloading, type-safe branchless conditionals, type traits, templated conversions etc.
- Full test coverage.
- Easy and type-safe comparison, and branchless conditional selection of results.
- The C++ classes are written in terms of the C implementation.
- The C++ classes are (mostly) immutable, apart from the
+=
,-=
,*=
,/=
,&=
,|=
,^=
operators. Apart from these operators, all non-static methods are markedconst
and return a new value. - Templated methods and type traits make it easy to write templated C++ SIMD code that works on several different types.
- The C implementation mostly uses macros (but only where arguments are evaluated no more than once), for faster debug builds. C code using the library should take almost the same amount of time to compile as if you had used the intrinsics directly.
- The emulated implementation, using built-in scalar types and standard library functions, allows for compilation on any target. (This implementation can be forced by defining
SIMD_GRANODI_FORCE_GENERIC
). It also serves as documentation for those not familiar with SIMD intrinsics. - Avoids undefined behaviour.
- Tested on GCC NEON/SSE2, Clang NEON/SSE2, and MSVC++ x64. Currently Untested on MSVC++ NEON.
simd_granodi.h
is the only file you need.- Designed to be easy enough to use for someone who has never done any SIMD programming before, but who is very familiar with C++.
This library was written so that the author could easily write cross-platform Audio DSP code that runs on all x86_64 machines, and newer AArch64 machines (ie the latest hardware from Apple). It targets 128-bit SIMD types (32x4 or 64x2). However, the emulated implementation should run on any hardware.
The x86_64 implementation limits itself to SSE2, as many otherwise-capable modern low-end CPUs do not support AVX, SSE2 is guaranteed to be implemented on all x86_64 machines, and later SSE instructions (eg SSE 4.2) only provide a marginal speed improvement to the functionality of this library. Also, NEON only supports 128-bit vectors. However, it would be possible to add AVX support in future, and emulate this on NEON by wrapping 2 or more 128-bit registers.
If the header cannot detect that you are on x64 or AArch64 using Clang or GCC, or using MSVC++ (x64 only), it will revert to using a "generic" implementation which emulates SIMD using standard C/C++ built-in scalar types and standard library functions. This guarantees that you can compile for any target, even if performance is not as good.
Some platforms do not have intrinsic functions for some SIMD operations, and so they are emulated using standard library functions and may be slower. A list of these functions/macros, per-platform, is contained in a comment at the start of the simd_granodi.h
file. If you are using the C++ classes, you may wish to search for those names in the file to see which methods they correspond to. In future this documentation will be updated with more details.
All of the C++ code is inside the namespace simd_granodi
. All of the code examples below assume you are using namespace simd_granodi;
, but you may choose to do something like namespace sg = simd_granodi;
.
The C functions / macros are not inside a namespace (because of the use of macros), but typically have the prefix sg_
.
All of the following vector types are 128-bit in size:
Vec_pi32
- Vector of 4 32-bit signed packed integers. AKAVec_s32x4
Vec_pi64
- Vector of 2 64-bit signed packed integers. AKAVec_s64x2
Vec_ps
- Vector of 4 packed single-precision floating point values. AKAVec_f32x4
Vec_pd
- Vector of 2 packed double-precision floating point values. AKAVec_f64x2
The following vector types are 64-bit in size:
Vec_s32x2
- Vector of 2 32-bit signed integersVec_f32x2
- Vector of 2 32-bit floating point values
Note: On SSE2, Vec_s32x2
and Vec_f32x2
are emulated. See below for explanation.
The following are "scalar wrapper" types, which allow you to write templated code that operates either on built-in C++ types or SIMD vectors:
Vec_s32x1
- Wrapper forint32_t
Vec_s64x1
- Wrapper forint64_t
Vec_f32x1
- Wrapper forfloat
. AKAVec_ss
, vector of single single-precision floating point value.Vec_f64x1
- Wrapper fordouble
. AKAVec_sd
, vector of single double-precision floating point value.
The following are type-safe comparison-types (implemented as bit-masks) that arise as the result of comparing two vectors:
Compare_pi32
- Result of comparing twoVec_pi32
. AKACompare_s32x4
Compare_pi64
- Result of comparing twoVec_pi64
. AKACompare_s64x2
Compare_ps
- Result of comparing twoVec_ps
. AKACompare_f32x4
Compare_pd
- Result of comparing twoVec_pd
. AKACompare_f64x2
Compare_s32x2
- Result of comparing twoVec_s32x2
Compare_f32x2
- Result of comparing twoVec_f32x2
The following are type-safe comparison types for the equivalent "scalar wrapper" types, allowing you to write templated code that operates either on vectors or C++ built-in types. They are a simple wrapper for bool
, and should get optimized out when used:
Compare_s32x1
- Result of comparing twoVec_s32x1
Compare_s64x1
- Result of comparing twoVec_s64x1
Compare_f32x1
- Result of comparing twoVec_f32x1
. AKACompare_ss
Compare_s64x1
- Result of comparing twoVec_f64x1
. AKACompare_sd
On NEON, Vec_s32x2
and Vec_f32x2
are both native types. But on SSE2, they are emulated via a struct containing two int32_t
or two float
respectively. These types are useful as they take up less space than Vec_pi32
and Vec_ps
(ie if you hold a large array or other data structure containing them). But for long running calculations, you can use the following type alias to convert to the fastest in-register type with size of at least 2 elements:
Vec_s32x2::fast_register_t
- defined asVec_pi32
on SSE2, andVec_s32x2
on all other platformsVec_f32x2::fast_register_t
- defined asVec_ps
on SSE2, andVec_f32x2
on all other platforms.
These are type aliases, so in order to use them you must use the .to<NewType>()
templated method to convert. For example, Vec_f32x2{5.0f, 4.0f}.to<typename Vec_f32x2::fast_register_t>()
will give you the fastest in-register vector for your platform containing values {5.0f, 4.0f}
.
In order to obtain improved performance when compiling x64 code using the SIMD C++ classes, it is recommended to take the following steps:
Use the sg_vectorcall(f)
macro to define your own functions which take float
, double
, or any SIMD type or SIMD class wrapper type as an argument, to avoid unnecessary loads and stores. On MSVC++ under x64, this macro is defined as:
#define sg_vectorcall(f) __vectorcall f
and on other platforms, this macro is defined as the identity macro:
#define sg_vectorcall(f) f
and so has no effect on your function declaration / definition.
Example of using the sg_vectorcall()
macro:
float sg_vectorcall(my_func)(const float x) {
return x + 12.0f;
}
Vec_ps sg_vectorcall(my_func_simd)(const Vec_ps x) {
return x + 12.0f;
}
On MSVC++, passing SIMD class types by const
reference can introduce unnecessary loads and stores.
Functions which take an argument whose type is one of the C++ SIMD classes cause MSVC++ to place a security cookie on the stack before that function is called, and check that cookie again when the function returns. (This only happens if the function is not inlined). This is a sensible way to check for stack corruption, but can add overhead if you repeatedly call a function which (for example) takes a Vec_ps
as an argument, but is large enough to not get inlined.
- All vector types are default-constructed to hold a value of zero. This is for safety and convenience, and this assignment typically gets optimized out.
- All comparison types are default-constructed to hold a value of
false
. With vector comparisons, this is a bit-mask comprised of all zeros. With "scalar wrapper" comparisons, this is abool
with valuefalse
.
- All vector types have a "broadcast" constructor that takes a single value and "broadcasts" it to all elements of the vector. For example,
Vec_ps{3.0f}
will result in the vectorVec_ps{3.0f, 3.0f, 3.0f, 3.0f}
. This also allows for convenient arithmetic with constants, for exampleVec_ps{3.0f} + 1.0f
will give the vector{4.0f, 4.0f, 4.0f, 4.0f}
as the1.0f
is implicitly constructed into aVec_ps
. - Note that vector types do not have a broadcast constructor that takes an equivalent "scalar wrapper" type as an argument, due to constructor overload ambiguity when using literals/constants in code. Ie you can not do
Vec_ps{Vec_f32x1{1.0f}}
. However there are easy workarounds for this: construct from the.data()
element of the scalar wrapper type, or convert the scalar wrapper type using the.to<NewType>()
method. - All comparison types also have a "broadcast" constructor which accepts a
bool
. For vector comparisons, all of the bits of the bit-mask are set to0
if this isfalse
, or1
if this is true. For the "scalar wrapper" types, this sets the value of thebool
member.
The rationale behind the broadcast constructors, is that they allow constants to be easily mixed in with vector code. For example, Vec_pd{5.0, 4.0} + 2.0
will give the vector {7.0, 6.0}
as 2.0
is implicitly constructed into a Vec_pd
of {2.0, 2.0}
.
There is no Vec_pd
constructor that accepts a single Vec_f64x1
scalar wrapper type to broadcast to all elements (using these types as an example). This is because Vec_pd{5.0, 4.0} + 2.0
would cause an ambiguity between Vec_pd{5.0, 4.0f} + Vec_pd{Vec_f64x1{2.0}}
or Vec_pd{5.0, 4.0f} + Vec_pd{2.0}
, as both Vec_pd
and Vec_f64x1
can be constructed from a double
.
- All Vector types can be constructed by specifying the value of each element. Following the SSE2 convention, the elements are specified in reverse order. For example,
Vec_ps{3.0f, 2.0f, 1.0f, 0.0f}
creates a vector with the value 3 at index 3, value 2 at index 2, value 1 at index 1, and value 0 at index 0. When these values are compile-time constants, this constructor is typically optimized into a single instruction, otherwise it may take several instructions. - Vector types with 4 elements can be constructed by specifying the value of the lowest 2 or 3 elements, and the upper 1 or 2 elements will be zeroed. Eg
Vec_ps{3.0f, 7.0f}
evaluates as{0.0f, 0.0f, 3.0f, 7.0f}
. - All comparison types also have a "vector" constructor that takes a
bool
to specify a bit-mask for each element.true
will be interpreted as all bits set to 1, andfalse
will be interpreted as all bits set to 0. The reverse ordering convention is the same as for vector types.
Every Vec_
type has load
, loadu
, store
, and storeu
methods. Warning: The load
and store
methods take pointers to data that must be correctly aligned.
The load
and loadu
(u
means unaligned) are static
methods that take a pointer to the vector's element type (ie, int32_t*
, int64_t*
, float*
, or double*
) and construct a new vector from the elements pointed to. For example, auto vec = Vec_pd::loadu(&my_double_array[4])
will result in a vector of value {my_double_array[5], my_double_array[4]}
. Please note this is a static
method to construct a new vector, and not a way of loading values into an existing vector.
The store
and storeu
methods are similar, except they are not static
and return void
. They store the vector at the given element pointer location. Eg using the vec
variable from the previous paragraph, you could then do vec.store(&my_double_array[4])
to store the vector back to where you loaded it from.
The elements of a vector can be accessed with the templated .get<int32_t>()
method. The index is passed as the template argument. An out of range index will cause a compile time error. Example: Vec_pd{4.0, 3.0}.get<1>()
will return a double
with value 4.0
.
Every vector and comparison type also has a .data()
method that allows access to the underlying, built-in representation of that type (eg a value of type float
or __m128d
).
An element of a vector can be changed with the templated .set<int32_t>(new_val)
method. But note that this method is const, and returns a new vector, leaving the original vector unchanged.
Eg Vec_ps{7.0f, 2.0f, 5.0f, 4.0f}.set<2>(6.0f)
returns a new Vec_ps
of value {7.0f, 6.0f, 5.0f, 4.0f}
. Note that this is efficiently implemented in-register, and on most good compilers will not result in any loads or stores.
All Vec_
types implement the following standard arithmetic operators: +=
, +
, -=
, -
, *=
, *
, /=
, /
. Also, integer types support both the pre- and postfix ++
and --
operators.
All Vec_
types, including floating-point types, implement the following bitwise operators: &=
, &
, |=
, |
, ^=
, ^
, ~
.
Note: You might assume that, since Compare_
types use bitwise operations internally to mask or "select" a result, that they also implement these bitwise operators. However, for reasons of type safety, they do not implement these operators at all. Instead, they only implement the logical operators &&
, ||
, !
, as well as ==
and !=
.
When a comparison operator is used with two Vec_
types, it returns a result of the corresponding Compare_
type. All vector types implement the following comparison operators: <
, <=
, ==
, !=
, >=
, >
.
All Compare_
types support the following logical operators: &&
, ||
, !
.
All Compare_
types support the following comparison operators: ==
, !=
.
All Compare_
types have two important methods: .choose(vec_true, vec_false)
and .choose_else_zero(vec_true)
. vec_true
and vec_false
must be of the Vec_
type that corresponds to the Compare_
type. These methods return a Vec_
by selection.
This is best explained by example: (Vec_ps{3.0f} < 2.0f).choose(7.0f, 8.0f)
will return Vec_ps{8.0f}
, because 3 is not smaller than 2.
The .choose_else_zero()
methods are a common optimization of .choose()
. Whereas .choose()
typically takes four CPU instructions, .choose_else_zero()
only takes one. Using the example above, (Vec_ps{3.0f} < 2.0f).choose_else_zero(7.0f)
would return Vec_ps{0.0f}
.
For the 128-bit vector types, the .choose()
and .choose_else_zero()
methods simply use bit-masking. The advantage is that this is completely branch-less, but the disadvantage is that both "branches" or possibilities are calculated: the unneeded result is then discarded.
For the "scalar wrapper" types, the .choose()
and .choose_else_zero()
methods may also appear to calculate both "branches", and in fact they will do so in unoptimized builds. But, often these methods will get inlined and the compiler will generate a conditional jump, so usually only one "branch" is calculated.
All signed integer types implement bit-shifting methods. These are not implemented for floating point types. If you wish to shift by an immediate value (compile-time constant), you can use one of the following methods where amount
must be a compile-time constant of type int32_t
:
.shift_l_imm<amount>()
- Return a new vector with each element shifted left byamount
..shift_rl_imm<amount>()
- Return a new vector with each element shifted right logically byamount
..shift_ra_imm<amount>()
- Return a new vector with each element shifted right arithmetically byamount
.
For shifting by an amount determined at run-time, by another Vec_
of the same type, these are:
.shift_l(const Vec_ amount)
- Return a new vector with each element shifted left by the corresponding element inamount
..shift_rl(const Vec_ amount)
- Return a new vector with each element shifted right logically by the corresponding element inamount
..shift_ra(const Vec_ amount)
- Return a new vector with each element shifted right arithmetically by the corresponding element inamount
.
All Vec_
types with more than one element have a templated .shuffle<>()
method that takes either 2 or 4 template parameters of type int32_t
and returns a new vector with its internal elements rearranged. The template parameters represent the source indexes for the new vector, and this method will fail to compile if they are out of range.
This is best explained via example:
Vec_ps{7.0f, 6.0f, 5.0f, 4.0f}.shuffle<3, 2, 1, 0>()
returnsVec_ps{7.0f, 6.0f, 5.0f, 4.0f}
- this is the "identity" shuffle as nothing changes.Vec_ps{7.0f, 6.0f, 5.0f, 4.0f}.shuffle<0, 1, 2, 3>()
returnsVec_ps{4.0f, 5.0f, 6.0f, 7.0f}
. We have reversed the elements.Vec_pd{7.0, 6.0}.shuffle<1, 1>()
returnsVec_pd{7.0, 7.0}
, as we choose the highest (1) index as the source for both elements of our new vector.
On SSE2, shuffles take 1 CPU instruction. On NEON, they take between 1 and 3 CPU instructions, depending on the shuffle.
Any Vec_
type can be bitcasted to any other Vec_
type of the same total size. (The elements do not need to be the same size, but the total size of the two vectors must be the same). To do this, you use the .bitcast<typename To>()
method. Eg Vec_ps{4.0f}.bitcast<Vec_pi64>()
will re-interpret 4 packed 32-bit floating point values as 2 packed 64-bit signed integers. This particular bitcast is allowed because they are both the same size of 128 bits.
For 128-bit vectors, bitcasing is usually a no-op (compiles to zero CPU instructions), but the subsequent switch to a different "pipeline" may or may not incur a small performance penalty depending on the hardware. But for the "scalar wrapper" types, bitcasting is achieved via memcpy()
which usually gets optimized into a single register move instruction.
You can not bitcast a Compare_
type to any other type (including another Compare_
type), but you can convert between Compare_
types (see below).
You can convert to and from any Vec_
type. In general, this is achieved using the templated .to<typename To>()
method. However, this method is not implemented for converting from a floating point type to an integer type. This is because a rounding method needs to be specified, using one of the following templated methods: .truncate<typename To>()
, .floor<typename To>()
, or .nearest<typename To>()
.
Also, at the time of writing, you cannot convert from a vector containing more than one element to a scalar wrapper type containing only one element. See the section below for how to get around this.
When you convert from a 32x4 vector (i.e. Vec_pi32
and Vec_ps
) to a 64x2 or 32x2 vector (i.e. Vec_pi64
, Vec_pd
, Vec_s32x2
, and Vec_f32x2
), the lowest two elements from the 32x4 vector (at indexes 0 and 1) will be converted to new values for the 64x2 or 32x2 vector (and placed at indexes 0 and 1), and the highest two elements from the 32x4 vector (at indexes 2 and 3) will be discarded.
When you convert from a 64x2 vector (i.e. Vec_pi64 and Vec_pd) to a 32x4 vector (i.e. Vec_pi32 and Vec_ps), the elements from the 64x2 vector (at indexes 0 and 1) will be converted to new values and placed into indexes 0 and 1 of the the 32x4 vector. Indexes 2 and 3 of the 32x4 vector will be set to zero.
- A "scalar wrapper" vector can be converted to any other vector type using the conversion methods.
- A vector with more than one element cannot be converted to a "scalar wrapper" vector type using the conversion methods. Instead, you must use the templated
.get<int32_t index>()
method to choose an element from the vector, then use that element to construct a new "scalar wrapper" vector type.
- When converting a 64-bit float type to a 32-bit float type, the platform's default rounding method will be used. This is usually "round to nearest", with ties rounding to even.
- When converting a 32-bit float vector type to a 64-bit float vector type, there will be no loss of precision as the 64-bit type can represent the 32-bit type exactly.
When converting from a float type to an integer type, you cannot use the templated .to<typename To>()
method. Instead, you must use one of the following methods:
.truncate<typename To>()
: Round towards zero..floor<typename To>()
: Round towards minus infinity..nearest<typename To>()
: Round to nearest, with ties rounding to even.
Warning: the templated .nearest<typename To>()
method on SSE2 assumes that you have not changed the default rounding mode, as this is the mode it will use to convert.
... allows you to construct a new Vec_
type from a different type, with the exact same behaviour as the .to<typename To>()
method. As with .to<typename To>()
, you cannot construct an integer type from a float type, and you cannot convert a vector containing more than one element to a vector containing only one element.
The .to<typename To>()
method also works for Compare_
types, to resize bitmasks. (For example, comparing two Vec_pd
and using the result of that comparison to select Vec_pi64
results).
For Compare_
types whose elements are of different sizes (i.e. 32x4 or 64x2), the conversion behaviour is identical to that described for vectors above: The bitmasks will be resized, and the lowest two elements used.
To aid in templated programming, all Vec_
types define the following type aliases as members:
elem_t
: The built-in type that corresponds to the elements of the vector. I.e.Vec_pi32:elem_t
isint32_t
.compare_t
: TheCompare_
type that corresponds to the vector. I.e.Vec_pi32::compare_t
isCompare_pi32
.fast_register_t
: The fastest in-register type that has at least as many number of elements. For example, on SSE2,Vec_f32x2::fast_register_t
is defined asVec_ps
, but on all other platforms it is defined asVec_f32x2
.
All Vec_
types also define the following static constexpr
members:
is_int_t
- abool
indicating whether the vector is an integer type or notis_float_t
- abool
indicating whether the vector is a floating point type or notelem_size
- astd::size_t
giving the size, in bytes, of each element of the vector. EgVec_ps::elem_size
is equal to 4, because afloat
takes up 4 bytes.elem_count
- astd::size_t
giving the number of elements the vector has. EgVec_pd::elem_count
is 2, because it contains two values of typedouble
.
All floating point Vec_
types define the following type alias:
fast_convert_int_t
- An integer type that it is fast to convert to and from. For example, on SSE2,Vec_pd::fast_convert_int_t
is defined asVec_pi32
, but on NEON, it is defined asVec_pi64
.
SGType<typename ElemType, std::size_t ElemCount>
- Thisstruct
allows you to query itsvalue
member to find a vector type with the given element type and number of elements. Egtypename SGType<float, 4>::value
gives youVec_ps
.SGIntType<std::size_t ElemSize, std::size_t ElemCount>
- allows you to find an integer vector with the given element size (in bytes) and number of elements. Egtypename SGIntType<8, 2>::value
gives youVec_pi64
.SGFloatType<Std::size_t ElemSize, std::size_t ElemCount>
- as withSGIntType
, but with floating point types. Egtypename SGFloatType<4, 2>::value
gives youVec_f32x2
.SGEquivIntType<typename VecType>
- allows you to find an integer type whose element size and element count are the same asVecType
. Egtypename SGEquivIntType<Vec_pd>::value
gives youVec_pi64
. Note that this does not take conversion speed into account.SGEquivFloatType<typename VecType>
- as withSGEquivIntType
, but with equivalent floating point types. Egtypename SGEquivFloatType<Vec_pi32>::value
gives youVec_ps
. Note that this does not take conversion speed into account.
More documentation to follow in a future update.