Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Badly optimized small bytes array search #75659

Closed
leonardo-m opened this issue Aug 18, 2020 · 1 comment · Fixed by #76971
Closed

Badly optimized small bytes array search #75659

leonardo-m opened this issue Aug 18, 2020 · 1 comment · Fixed by #76971
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@leonardo-m
Copy link

type T = u8;

pub fn foo1(x: T, data: &[T; 1]) -> bool {
    data.contains(&x)
}

pub fn foo2(x: T, data: &[T; 2]) -> bool {
    data.contains(&x)
}

pub fn foo3(x: T, data: &[T; 3]) -> bool {
    data.contains(&x)
}

pub fn foo4(x: T, data: &[T; 4]) -> bool {
    data.contains(&x)
}

pub fn foo16(x: T, data: &[T; 16]) -> bool {
    data.contains(&x)
}

When T is u16, u32, u64 or u128 the generated code is reasonable, like for u32 (rustc 1.47.0-nightly 7e6d6e5 2020-08-16, using -C opt-level=3 and more):

example::foo1:
        cmp     dword ptr [rsi], edi
        sete    al
        ret

example::foo2:
        cmp     dword ptr [rsi], edi
        sete    cl
        cmp     dword ptr [rsi + 4], edi
        sete    al
        or      al, cl
        ret

example::foo3:
        cmp     dword ptr [rsi], edi
        jne     .LBB2_2
        mov     al, 1
        ret
.LBB2_2:
        cmp     dword ptr [rsi + 4], edi
        sete    cl
        cmp     dword ptr [rsi + 8], edi
        sete    al
        or      al, cl
        ret

example::foo4:
        cmp     dword ptr [rsi], edi
        je      .LBB3_3
        cmp     dword ptr [rsi + 4], edi
        jne     .LBB3_2
.LBB3_3:
        mov     al, 1
        ret
.LBB3_2:
        cmp     dword ptr [rsi + 8], edi
        sete    cl
        cmp     dword ptr [rsi + 12], edi
        sete    al
        or      al, cl
        ret

example::foo16:
        xor     eax, eax
.LBB4_1:
        cmp     rax, 64
        je      .LBB4_2
        cmp     dword ptr [rsi + rax], edi
        lea     rax, [rax + 4]
        jne     .LBB4_1
        mov     al, 1
        ret
.LBB4_2:
        xor     eax, eax
        ret

But when T is u8 or i8 it looks sub-optional for small arrays, because for such tiny arrays the overhead of calling another functions is bad:

example::foo1:
        push    rax
        mov     byte ptr [rsp + 7], dil
        lea     rdi, [rsp + 7]
        mov     edx, 1
        call    qword ptr [rip + <u8 as core::slice::SliceContains>::slice_contains@GOTPCREL]
        pop     rcx
        ret

example::foo2:
        push    rax
        mov     byte ptr [rsp + 7], dil
        lea     rdi, [rsp + 7]
        mov     edx, 2
        call    qword ptr [rip + <u8 as core::slice::SliceContains>::slice_contains@GOTPCREL]
        pop     rcx
        ret

example::foo3:
        push    rax
        mov     byte ptr [rsp + 7], dil
        lea     rdi, [rsp + 7]
        mov     edx, 3
        call    qword ptr [rip + <u8 as core::slice::SliceContains>::slice_contains@GOTPCREL]
        pop     rcx
        ret

example::foo4:
        push    rax
        mov     byte ptr [rsp + 7], dil
        lea     rdi, [rsp + 7]
        mov     edx, 4
        call    qword ptr [rip + <u8 as core::slice::SliceContains>::slice_contains@GOTPCREL]
        pop     rcx
        ret

example::foo16:
        push    rax
        mov     byte ptr [rsp + 7], dil
        lea     rdi, [rsp + 7]
        mov     edx, 16
        call    qword ptr [rip + <u8 as core::slice::SliceContains>::slice_contains@GOTPCREL]
        pop     rcx
        ret
@LeSeulArtichaut LeSeulArtichaut added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Aug 18, 2020
@tesuji
Copy link
Contributor

tesuji commented Aug 18, 2020

I think that's because for we use rust-implemented memchr:

trait SliceContains: Sized {
fn slice_contains(&self, x: &[Self]) -> bool;
}
impl<T> SliceContains for T
where
T: PartialEq,
{
default fn slice_contains(&self, x: &[Self]) -> bool {
x.iter().any(|y| *y == *self)
}
}
impl SliceContains for u8 {
fn slice_contains(&self, x: &[Self]) -> bool {
memchr::memchr(*self, x).is_some()
}
}
impl SliceContains for i8 {
fn slice_contains(&self, x: &[Self]) -> bool {
let byte = *self as u8;
// SAFETY: `i8` and `u8` have the same memory layout, thus casting `x.as_ptr()`
// as `*const u8` is safe. The `x.as_ptr()` comes from a reference and is thus guaranteed
// to be valid for reads for the length of the slice `x.len()`, which cannot be larger
// than `isize::MAX`. The returned slice is never mutated.
let bytes: &[u8] = unsafe { from_raw_parts(x.as_ptr() as *const u8, x.len()) };
memchr::memchr(byte, bytes).is_some()
}
}

While clang has built-in memchr: https://clang.llvm.org/docs/LanguageExtensions.html#string-builtins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants