Add support for faster shuffles #280

velvia · 2020-04-02T21:24:21Z

Currently u32x8 shuffle1_dyn are not optimized and fallback is used which results in a whole mess of extract intrinsics. It is not very fast.

Can we please add support for _mm256_permutevar8x32_epi32 and similar variants at the u32x8 (and f32x8, etc.) levels? It is a fairly large speedup.

Thanks

The text was updated successfully, but these errors were encountered:

aldanor · 2020-12-25T00:03:14Z

Wondering about this as well (it's 30x slower than what it should be, without warning the user).

(should this be posted to stdsimd repo?)

Lokathor · 2020-12-25T00:13:20Z

Yes, all development has moved there.

Lokathor added the Enhancement New feature or request label Sep 22, 2020

Provide feedback