Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arm64] Use stp and str (SIMD) for stack prolog zeroing #43789

Closed
echesakov opened this issue Oct 23, 2020 · 18 comments · Fixed by #46609
Closed

[Arm64] Use stp and str (SIMD) for stack prolog zeroing #43789

echesakov opened this issue Oct 23, 2020 · 18 comments · Fixed by #46609
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged
Milestone

Comments

@echesakov
Copy link
Contributor

Currently, void CodeGen::genZeroInitFrame(int untrLclHi, int untrLclLo, regNumber initReg, bool* pInitRegZeroed)
inlines a zeroing loop for frames larger than 10 machine words (80 bytes on Arm64). The loop uses a wzr or xzr register and stp or str instructions and can write up to 16 bytes of zeros at once.

Following ideas in #32538 we can

  1. zero-init a SIMD register qReg
  2. use the register qReg instead of xzr with stp qReg, qReg, [mem] allowing to write up to 32 bytes of zeros to memory in one instruction.

We can also consider increasing the upper boundary (i.e. 10 machine words) to some larger number.

It seems that Clang/LLVM uses similar way for initializing stack allocated structs.
https://godbolt.org/z/8rKxvn

For example,

#include <string.h>

struct int32x4 
{
    int _1;
    int _2;
    int _3;
    int _4;
};

struct int32x8 
{
    int _1;
    int _2;
    int _3;
    int _4;
    int _5;
    int _6;
    int _7;
    int _8;
};

struct int32x16
{
    int _1;
    int _2;
    int _3;
    int _4;
    int _5;
    int _6;
    int _7;
    int _8;
    int _9;
    int _10;
    int _11;
    int _12;
    int _13;
    int _14;
    int _15;
    int _16;
};

void ZeroInt32x4(void* pDst, int cnt)
{
    int32x4 src = { };
    memcpy(pDst, &src, cnt);
}

void ZeroInt32x8(void* pDst, int cnt)
{
    int32x8 src = { };
    memcpy(pDst, &src, cnt);
}

void ZeroInt32x16(void* pDst, int cnt)
{
    int32x16 src = { };
    memcpy(pDst, &src, cnt);
}

would be compiled down to

ZeroInt32x4(void*, int):                     // @ZeroInt32x4(void*, int)
        sub     sp, sp, #32                     // =32
        stp     x29, x30, [sp, #16]             // 16-byte Folded Spill
        add     x29, sp, #16                    // =16
        sxtw    x2, w1
        mov     x1, sp
        stp     xzr, xzr, [sp]
        bl      memcpy
        ldp     x29, x30, [sp, #16]             // 16-byte Folded Reload
        add     sp, sp, #32                     // =32
        ret
ZeroInt32x8(void*, int):                     // @ZeroInt32x8(void*, int)
        sub     sp, sp, #48                     // =48
        stp     x29, x30, [sp, #32]             // 16-byte Folded Spill
        add     x29, sp, #32                    // =32
        movi    v0.2d, #0000000000000000
        sxtw    x2, w1
        mov     x1, sp
        stp     q0, q0, [sp]
        bl      memcpy
        ldp     x29, x30, [sp, #32]             // 16-byte Folded Reload
        add     sp, sp, #48                     // =48
        ret
ZeroInt32x16(void*, int):                    // @ZeroInt32x16(void*, int)
        sub     sp, sp, #80                     // =80
        stp     x29, x30, [sp, #64]             // 16-byte Folded Spill
        add     x29, sp, #64                    // =64
        movi    v0.2d, #0000000000000000
        sxtw    x2, w1
        mov     x1, sp
        stp     q0, q0, [sp, #32]
        stp     q0, q0, [sp]
        bl      memcpy
        ldp     x29, x30, [sp, #64]             // 16-byte Folded Reload
        add     sp, sp, #80                     // =80
        ret

@dotnet/jit-contrib @TamarChristinaArm

@echesakov echesakov added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Oct 23, 2020
@echesakov echesakov added this to the 6.0.0 milestone Oct 23, 2020
@echesakov echesakov self-assigned this Oct 23, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Oct 23, 2020
@echesakov echesakov removed the untriaged New issue has not been triaged by the area owner label Oct 24, 2020
@TamarChristinaArm
Copy link
Contributor

Indeed, this would be a good optimization. You might want to consider treating this as a general inlined memset and memcopy like the C compilers do, where 0 is just an small optimization. This way you can use it for Array.Fill and Span.Fill for small sets as well. In either case here are some interesting cases for you to look at https://godbolt.org/z/cff7Ed

In particular look at how the compiler behaves when the number of bytes to copy is not a power of 2. We then just issue overlapping instructions. i.e. to fill 46 bytes, we use one stp of q to fill 32 bytes, and fill the remaining 14 bytes by doing another str q but moving the pointer 2 bytes back into the field we just filled giving the overlapping set.

#include <string.h>

void ZeroInt32x4(void* pDst, void* pSrc)
{
    memset(pDst, 0, 16);
}

void ZeroInt32x8(void* pDst, void* pSrc)
{
    memset(pDst, 0, 32);
}

void ZeroInt32x16(void* pDst, void* pSrc)
{
    memset(pDst, 0, 64);
}

void ZeroInt32x5(void* pDst, void* pSrc)
{
    memset(pDst, 0, 24);
}

void memsetCharUneven(void* pDst)
{
    memset(pDst, 8, 30);
}

void memsetCharUnevenComplex(void* pDst)
{
    memset(pDst, 'c', 30);
}

void memsetChar(void* pDst)
{
    memset(pDst, 'c', 32);
}

void memsetCharUnevenUnknown(void* pDst, char c)
{
    memset(pDst, c, 30);
}

void memsetCharUnknown(void* pDst, char c)
{
    memset(pDst, c, 30);
}

void memsetUneven(void* pDst, void* pSrc)
{
    memset(pDst, 0, 46);
}

void MemcpyInt32x16(void* pDst, void* pSrc)
{
    memcpy(pDst, pSrc, 64);
}

and it's assembly

ZeroInt32x4(void*, void*):                    // @ZeroInt32x4(void*, void*)
        stp     xzr, xzr, [x0]
        ret
ZeroInt32x8(void*, void*):                    // @ZeroInt32x8(void*, void*)
        movi    v0.2d, #0000000000000000
        stp     q0, q0, [x0]
        ret
ZeroInt32x16(void*, void*):                   // @ZeroInt32x16(void*, void*)
        movi    v0.2d, #0000000000000000
        stp     q0, q0, [x0, #32]
        stp     q0, q0, [x0]
        ret
ZeroInt32x5(void*, void*):                    // @ZeroInt32x5(void*, void*)
        stp     xzr, xzr, [x0]
        str     xzr, [x0, #16]
        ret
memsetCharUneven(void*):                 // @memsetCharUneven(void*)
        mov     x8, #578721382704613384
        stur    x8, [x0, #22]
        stp     x8, x8, [x0, #8]
        str     x8, [x0]
        ret
memsetCharUnevenComplex(void*):          // @memsetCharUnevenComplex(void*)
        mov     x8, #25443
        movk    x8, #25443, lsl #16
        movk    x8, #25443, lsl #32
        movk    x8, #25443, lsl #48
        stur    x8, [x0, #22]
        stp     x8, x8, [x0, #8]
        str     x8, [x0]
        ret
memsetChar(void*):                       // @memsetChar(void*)
        movi    v0.16b, #99
        stp     q0, q0, [x0]
        ret
memsetCharUnevenUnknown(void*, char):         // @memsetCharUnevenUnknown(void*, char)
        and     x8, x1, #0xff
        mov     x9, #72340172838076673
        mul     x8, x8, x9
        stur    x8, [x0, #22]
        stp     x8, x8, [x0, #8]
        str     x8, [x0]
        ret
memsetCharUnknown(void*, char):               // @memsetCharUnknown(void*, char)
        and     x8, x1, #0xff
        mov     x9, #72340172838076673
        mul     x8, x8, x9
        stur    x8, [x0, #22]
        stp     x8, x8, [x0, #8]
        str     x8, [x0]
        ret
memsetUneven(void*, void*):                   // @memsetUneven(void*, void*)
        movi    v0.2d, #0000000000000000
        stur    q0, [x0, #30]
        stp     q0, q0, [x0]
        ret
MemcpyInt32x16(void*, void*):                 // @MemcpyInt32x16(void*, void*)
        ldp     q1, q0, [x1, #32]
        ldp     q3, q2, [x1]
        stp     q1, q0, [x0, #32]
        stp     q3, q2, [x0]
        ret

@TamarChristinaArm
Copy link
Contributor

As an additional side note, the loop in

is very tight, I'd recommend against emitting that but instead unroll the loop a couple of iterations. See section 4.4 of the optimization guide https://developer.arm.com/documentation/swog309707/a . If you do end up emitting such tight loops it's best to ensure the loop entry point is aligned to a 32byte boundary, see section 4.8. This will minimize the number of fetches that have to be done to execute the loop.

@echesakov
Copy link
Contributor Author

@TamarChristinaArm Thank you for all the pointers! Let me read through the documentation you mentioned and I will come back with the algorithm before implementing.

@kunalspathak
Copy link
Member

If you do end up emitting such tight loops it's best to ensure the loop entry point is aligned to a 32byte boundary, see section 4.8.

Thanks for pointing that out @TamarChristinaArm . As part of #43227, we are also working on aligning loop bodies to 32B boundary.

@TamarChristinaArm
Copy link
Contributor

We can also consider increasing the upper boundary (i.e. 10 machine words) to some larger number.

@echesakovMSFT coming back to this, particularly for zeroing you have an additional option:

On AArch64 you have the data cache zero instructions DC ZVA (Can be found under system registers not Instructions).

Essentially using this you can clear large blocks of memory by VA, but it can be disabled so support for it should be detected at runtime. This can be done by reading the DCZID_EL0 register. This register will tell you whether it's enabled and the size of the block it's configured to clear.

The expectation is that most systems will be configured with a usable amount such as 64-bytes in one go.
Once you know this you can use dc zva, <addr-reg> to do the clear, which will allow you to inline clearings for much larger blocks before needing a function call.

@echesakov
Copy link
Contributor Author

We can also consider increasing the upper boundary (i.e. 10 machine words) to some larger number.

@echesakovMSFT coming back to this, particularly for zeroing you have an additional option:

On AArch64 you have the data cache zero instructions DC ZVA (Can be found under system registers not Instructions).

Essentially using this you can clear large blocks of memory by VA, but it can be disabled so support for it should be detected at runtime. This can be done by reading the DCZID_EL0 register. This register will tell you whether it's enabled and the size of the block it's configured to clear.

The expectation is that most systems will be configured with a usable amount such as 64-bytes in one go.
Once you know this you can use dc zva, <addr-reg> to do the clear, which will allow you to inline clearings for much larger blocks before needing a function call.

Thank you for following up, Tamar! I've heard about this instruction but, for some reason, I was under (wrong) impression that it required EL1 or higher. Is there any alignment requirement for the address in <addr-reg>? My copy of reference doesn't say so.

@TamarChristinaArm
Copy link
Contributor

Is there any alignment requirement for the address in ? My copy of reference doesn't say so.

There is, but it's a bit sneakily described. The operation works on an entire cache line. If the address is not aligned to the cache line it will silently align it (ignore the lower bits) and clear the entire cache line.

Which means it will clear data you didn't intend to. So there is an alignment constraint for what you probably want, but not for the actual use of the instruction (as in, you won't get an alignment fault).

This requirement makes it impossible to use in static compilers (though we do use it in AoR's memset https://github.com/ARM-software/optimized-routines/blob/0f4ae0c5b561de25acb10130fd5e473ec038f89d/string/aarch64/memset.S#L79) but for JITs it may still be useful.

@echesakov echesakov linked a pull request Jan 6, 2021 that will close this issue
@TamarChristinaArm
Copy link
Contributor

TamarChristinaArm commented Jan 25, 2021

There is, but it's a bit sneakily described. The operation works on an entire cache line. If the address is not aligned to the cache line it will silently align it (ignore the lower bits) and clear the entire cache line.

Small correction here, the ZVA works on a ZVA region, which an be set independently of the cache line size. In practice on all current Arm designed cores the region is the same as the cache line size, but they don't need to be. So you have to be aligned to the ZVA region, not the cache line so the code you wrote in #46609 is still correct, but wanted to clarify the statement above :)

@echesakov echesakov added the in-pr There is an active PR which will close this issue when it is merged label Jan 25, 2021
@echesakov
Copy link
Contributor Author

Small correction here, the ZVA works on a ZVA region, which an be set independently of the cache line size. In practice on all current Arm designed cores the region is the same as the cache line size, but they don't need to be. So you have to be aligned to the ZVA region, not the cache line so the code you wrote in #46609 is still correct, but wanted to clarify the statement above :)

This was my understanding as well - that the instruction block size reported by DCZID_EL0<3..0> could be different from the cache line size. Thank you for confirming this!

As you probably noticed in #46609 I assumed that 64 bytes is the most common choice for DC ZVA block size and the JIT switches to "stp q-reg in a loop" implementation when it's not the case. Do you think we might want to extend the implementation in the future for other block sizes?

@TamarChristinaArm
Copy link
Contributor

Do you think we might want to extend the implementation in the future for other block sizes?

Possibly, but as you noted in the PR, the problem with larger ZVA is that as the ZVA region grows your alignment and remainder overhead grows. Which also pushes your profitability threshold for ZVA usage higher. And as that goes up, it means you regress your smaller memsets (vs a smaller ZVA region size).

So this will always be a balancing act. I don't have the data to back it up, but my personal opinion is that the average consumer/server workload you'll find the smaller sets more often than larger ones (The only exceptions I know to this in the HPC market, but that's pretty specialized). So support for this wouldn't really be a priority in the near future in my opinion.

@TamarChristinaArm
Copy link
Contributor

@echesakovMSFT btw, this issue ended up focusing only on zero'ing, but the approach outlined in #43789 (comment) should help with Array and Span.Fill.

Is it worth splitting that out?

@echesakov
Copy link
Contributor Author

@TamarChristinaArm Agree, I will try implementing the approach after finish the stack probing work on arm64.

@echesakov
Copy link
Contributor Author

btw, this issue ended up focusing only on zero'ing, but the approach outlined in #43789 (comment) should help with Array and Span.Fill.

Is it worth splitting that out?

Hi @TamarChristinaArm, I've given your idea more thoughts recently. In Span.Fill the length of the span is unknown during JIT compilation time, so what the JIT ended up generating is a STORE_DYN_BLK node that is lowered to a call to CORINFO_HELP_MEMSET. So unless I misunderstood your suggestion I am not sure how we can generate the sequence as in #43789 (comment) without knowing the length in advance.

However, I think we can optimize CodeGen::genCodeForInitBlkUnroll

void CodeGen::genCodeForInitBlkUnroll(GenTreeBlk* node)
and CodeGen::genCodeForCpBlkUnroll
void CodeGen::genCodeForCpBlkUnroll(GenTreeBlk* node)
and use stp (SIMD) instead of stp (GpReg).

cc @sandreenko

@TamarChristinaArm
Copy link
Contributor

Hi @TamarChristinaArm, I've given your idea more thoughts recently. In Span.Fill the length of the span is unknown during JIT compilation time, so what the JIT ended up generating is a STORE_DYN_BLK node that is lowered to a call to CORINFO_HELP_MEMSET. So unless I misunderstood your suggestion I am not sure how we can generate the sequence as in #43789 (comment) without knowing the length in advance.

I see.. If you never know the exact size you can still do somewhat better than a scalar loop though. You can still emit a loop that uses STP of Q registers to set 64 bytes at a time for some cases. Looking at https://docs.microsoft.com/en-us/dotnet/api/system.span-1.fill?view=netcore-2.2 since T must be a value type you can create a vector allowing you to store 128/sizeof(T) Ts at the same time.

So does CORINFO_HELP_MEMSET already use SIMD registers for the set?

Hmmm Looking at it https://github.com/dotnet/runtime/blob/79ae74f5ca5c8a6fe3a48935e85bd7374959c570/src/coreclr/vm/arm64/crthelpers.asm it looks like memset and memcpy can do much better. See https://github.com/ARM-software/optimized-routines/blob/master/string/aarch64/memset.S , it should probably also special case MEMSET of 0.

Since these are inline assembly, is there any reason you can't just use the AoR implementations? Those would be most optimal and save you a lot of work in this case :)

@TamarChristinaArm
Copy link
Contributor

@echesakovMSFT Also I'm wondering, unless I misread the dock, doesn't Span.Fill also work for non-byte sized element fills? What does it do in those cases? CORINFO_HELP_MEMSET looks like a standard byte oriented memset?

@echesakov
Copy link
Contributor Author

echesakov commented Mar 4, 2021

I see.. If you never know the exact size you can still do somewhat better than a scalar loop though. You can still emit a loop that uses STP of Q registers to set 64 bytes at a time for some cases. Looking at https://docs.microsoft.com/en-us/dotnet/api/system.span-1.fill?view=netcore-2.2 since T must be a value type you can create a vector allowing you to store 128/sizeof(T) Ts at the same time.

@TamarChristinaArm I believe there is an opportunity to optimize non-byte size element version of Span.Fill here

for (; i < (length & ~(nuint)7); i += 8)

using such approach. Presumably, this would mean that on Arm64 we would use SIMD stp for writing to memory.
I wonder if any special handling would be needed to support both little-endian and big-endian platforms.

So does CORINFO_HELP_MEMSET already use SIMD registers for the set?

Hmmm Looking at it https://github.com/dotnet/runtime/blob/79ae74f5ca5c8a6fe3a48935e85bd7374959c570/src/coreclr/vm/arm64/crthelpers.asm it looks like memset and memcpy can do much better. See https://github.com/ARM-software/optimized-routines/blob/master/string/aarch64/memset.S , it should probably also special case MEMSET of 0.

Since these are inline assembly, is there any reason you can't just use the AoR implementations? Those would be most optimal and save you a lot of work in this case :)

I don't know if there is any reason that could prevent us from using AoR implementation.

Note that for Linux, https://github.com/dotnet/runtime/blob/79ae74f5ca5c8a6fe3a48935e85bd7374959c570/src/coreclr/vm/arm64/crthelpers.S

we call memset in a C library, so presumably these ones are using SIMD operations.

Also I'm wondering, unless I misread the dock, doesn't Span.Fill also work for non-byte sized element fills? What does it do in those cases? CORINFO_HELP_MEMSET looks like a standard byte oriented memset?

In non-byte sizes element case,

for (; i < (length & ~(nuint)7); i += 8)

we use unrolled loops.

For byte-sized element case, we call Unsafe.InitBlockUnaligned that is intrinsic and replaced by IL stub in

else if (tk == CoreLibBinder::GetMethod(METHOD__UNSAFE__BYREF_INIT_BLOCK_UNALIGNED)->GetMemberDef())
{
static const BYTE ilcode[] = { CEE_LDARG_0, CEE_LDARG_1, CEE_LDARG_2, CEE_PREFIX1, (CEE_UNALIGNED & 0xFF), 0x01, CEE_PREFIX1, (CEE_INITBLK & 0xFF), CEE_RET };
methInfo->ILCode = const_cast<BYTE*>(ilcode);
methInfo->ILCodeSize = sizeof(ilcode);
methInfo->maxStack = 3;
methInfo->EHcount = 0;
methInfo->options = (CorInfoOptions)0;
return true;
}

that internally uses initblk that got compiled by genCodeForInitBlkHelper to a call to CORINFO_HELP_MEMSET

@TamarChristinaArm
Copy link
Contributor

Thanks for the detailed response @echesakovMSFT !

I see.. If you never know the exact size you can still do somewhat better than a scalar loop though. You can still emit a loop that uses STP of Q registers to set 64 bytes at a time for some cases. Looking at https://docs.microsoft.com/en-us/dotnet/api/system.span-1.fill?view=netcore-2.2 since T must be a value type you can create a vector allowing you to store 128/sizeof(T) Ts at the same time.

@TamarChristinaArm I believe there is an opportunity to optimize non-byte size element version of Span.Fill here

for (; i < (length & ~(nuint)7); i += 8)

using such approach. Presumably, this would mean that on Arm64 we would use SIMD stp for writing to memory.

Yes, but see blow.

I wonder if any special handling would be needed to support both little-endian and big-endian platforms.

The lanes themselves are in array order, so the dup of the right size should handle it. Whether the constant itself needs any handling is a good question, but the rest of the runtime should have ensured the value in a single register is already correct.

So does CORINFO_HELP_MEMSET already use SIMD registers for the set?
Hmmm Looking at it https://github.com/dotnet/runtime/blob/79ae74f5ca5c8a6fe3a48935e85bd7374959c570/src/coreclr/vm/arm64/crthelpers.asm it looks like memset and memcpy can do much better. See https://github.com/ARM-software/optimized-routines/blob/master/string/aarch64/memset.S , it should probably also special case MEMSET of 0.
Since these are inline assembly, is there any reason you can't just use the AoR implementations? Those would be most optimal and save you a lot of work in this case :)

I don't know if there is any reason that could prevent us from using AoR implementation.

Note that for Linux, https://github.com/dotnet/runtime/blob/79ae74f5ca5c8a6fe3a48935e85bd7374959c570/src/coreclr/vm/arm64/crthelpers.S

we call memset in a C library, so presumably these ones are using SIMD operations.

Yes, although you need a sufficiently new glibc to take advantage of the optimized routines. That said it's probably still a good idea to do this as glibc can ifunc to optimized implementations for various different uarches.
At the expense of course that older glibcs have a slower implementation. But time should fix that problem I suppose :)

That said, the AoR guys had the idea that you can use the current memset in AoR for the non-byte case as well by creating a new entry point here https://github.com/ARM-software/optimized-routines/blob/master/string/aarch64/memset.S#L29 just below the dup.

To call it you need to do a couple of things:

  1. fill q0 with the dup of the value you're setting
  2. ensure the pointer is aligned to the element size in bytes. i.e. if setting a short align the pointer to 2-bytes.
  3. ensure the size in bytes of the value being set it is aligned to the element size in bytes.
  4. set x1 to 1 or 0 depending on whether the value being set is 0 or not. (This allows you to get ZVA for e..g setting 0 as a short).

with these conditions you can remove unrolled loop in the clr and get the optimized memset for everything from short to long long.

@echesakov
Copy link
Contributor Author

echesakov commented Mar 15, 2021

Thank you for the follow-up @TamarChristinaArm!
I will take a look at your suggestion and also how we can incorporate AoR implementation into win-arm64 coreclr.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI in-pr There is an active PR which will close this issue when it is merged
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants