[Arm64] Use SIMD register to zero init frame #46609

echesakov · 2021-01-06T01:06:54Z

Below are the results of running benchmarks as in #32538

baseline
diff where DC ZVA instruction is prohibited
diff where DC ZVA instruction is permitted for memory regions greater than or equal to 256 bytes

echesakov · 2021-01-15T04:15:06Z

@TamarChristinaArm I was surprised to find that on Windows I almost don't see any performance improvement from the change. However, you can notice that the distribution becomes unimodal as opposed to baseline where you can see two trends.

Would you say that the results look reasonable? Do you have any other suggestions what else I can try to improve here other than using dc zva instruction (I decided to do this as a separate step).

I figured I don't need to employ all the optimizations as in https://github.com/ARM-software/optimized-routines/blob/0f4ae0c5b561de25acb10130fd5e473ec038f89d/string/aarch64/memset.S since the region size is multiple of 4 (actually, I think it is multiple of 8 but I need to double check this) while for memset it is important to handle all possible sizes.

TamarChristinaArm · 2021-01-15T20:45:01Z

@TamarChristinaArm I was surprised to find that on Windows I almost don't see any performance improvement from the change. However, you can notice that the distribution becomes unimodal as opposed to baseline where you can see two trends.

Correct me if I'm wrong, but the SQ2 is a Cortex-A76 based big.LITTLE core right? In which case the bimodal behavior of the core is most likely because it didn't enter streaming mode[1] for all the cases. The larger stores of the stp of Q-regs is most likely tickling the heuristic more consistently.

Would you say that the results look reasonable? Do you have any other suggestions what else I can try to improve here other than using dc zva instruction (I decided to do this as a separate step).

No, the only other thing you already have on your roadmap, which is the loop alignment. 16b should be enough for most cases.
The ZVA should help as well as [1] explains it will enter streaming mode more readily.

[1] https://developer.arm.com/documentation/100798/0300/functional-description/level-1-memory-system/cache-behavior/write-streaming-mode

TamarChristinaArm · 2021-01-15T20:51:07Z

I figured I don't need to employ all the optimizations as in https://github.com/ARM-software/optimized-routines/blob/0f4ae0c5b561de25acb10130fd5e473ec038f89d/string/aarch64/memset.S since the region size is multiple of 4 (actually, I think it is multiple of 8 but I need to double check this) while for memset it is important to handle all possible sizes.

Yeah if on average the memsets are of large regions then the smaller optimizations don't really make sense doing for now. For AoR it made sense as benchmark suites such as SPECCPU were mostly using smaller memsets.

echesakov · 2021-01-16T01:56:01Z

Correct me if I'm wrong, but the SQ2 is a Cortex-A76 based big.LITTLE core right? In which case the bimodal behavior of the core is most likely because it didn't enter streaming mode[1] for all the cases. The larger stores of the stp of Q-regs is most likely tickling the heuristic more consistently.

I don't know how to tell what microarchitecture it's based on. Is there any Windows tool that can do this?

The Wikipedia article says it is 4+4 cores that suggests it is big.LITTLE.

The ZVA should help as well as [1] explains it will enter streaming mode more readily.

[1] https://developer.arm.com/documentation/100798/0300/functional-description/level-1-memory-system/cache-behavior/write-streaming-mode

Thanks for the link - I am working on supporting dc zva in the JIT and VM. Will be back soonwith new benchmark results.

echesakov · 2021-01-21T02:17:27Z

Updated the first post with new results. Experimentally found that it becomes profitable to use dc zva inside a loop for memory regions >= 256 bytes. Note that I haven't used the instruction outside of the loop. Not sure if the benefits are worth complexity.

@TamarChristinaArm I have a question about using this instruction on big.LITTLE with different cache line sizes for big and little cores. Is it guaranteed for DCZID_EL0 to contain the same value if executed on a big and a LITTLE core? Would it be safe to check the register only once and cache its value as it is done here?

Would you suggest to use this instruction more broadly - for example, in other places where memory needs to be zeroed (e.g. GC heap)? If so, I can open an issue to track this work.

TamarChristinaArm · 2021-01-21T15:56:58Z

Nice speedup! Clearly a gain on both platforms, good work :)

I don't know how to tell what microarchitecture it's based on. Is there any Windows tool that can do this?

Hmm no not that I'm aware of.. You could get the parts number but unless specific support for it has been added to a public compiler we wouldn't be able to tell what it means.
Not really "official" but [1] seems to indicate it's indeed an Cortex-A55 and Cortex-A76 big.LITTLE pair.

[1] https://www.cpu-monkey.com/en/cpu-qualcomm_snapdragon_microsoft_sq2-1850

Updated the first post with new results. Experimentally found that it becomes profitable to use dc zva inside a loop for memory regions >= 256 bytes. Note that I haven't used the instruction outside of the loop. Not sure if the benefits are worth complexity.

Hmm do you actually see a performance degradation with < 256 bytes? If not using ZVA always when it's available and aligned region is >= 64 would simplify your logic a bit wouldn't it?

@TamarChristinaArm I have a question about using this instruction on big.LITTLE with different cache line sizes for big and little cores. Is it guaranteed for DCZID_EL0 to contain the same value if executed on a big and a LITTLE core? Would it be safe to check the register only once and cache its value as it is done here?

So there are a few parts to this question. It's theoretically not enforced anywhere that the little and big cores have the same cache line sizes. Practically speaking however, almost all systems have this. However those that don't can't ever use ZVA because you don't know when the kernel migrates you. You could be migrated after you read the register but before you clear it.

It's a fairly dangerous setup when the cores don't match. To deal with this, in the Linux kernel DCZID_EL0 is virtualized and its turned off when the cache lines don't match. I don't know what the Windows kernel does here, but I assume it must do something similar for correctness.

Would it be safe to check the register only once and cache its value as it is done here?

So on Linux/Android the answer is yes [2] (No code in link, just cover letter so should be safe to read). On Windows I can only assume yes as I don't know the behavior of the kernel here. But if it doesn't do something like the above then that would be a bug.

[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-September/453859.html

Would you suggest to use this instruction more broadly - for example, in other places where memory needs to be zeroed (e.g. GC heap)? If so, I can open an issue to track this work.

Yeah, I think it would be beneficial to use anywhere you need a memset. You might want to consider linking against AoR on Windows as well for the native code if that does a lot of memsets. On Linux you would get these for free through new glibc versions.

echesakov · 2021-01-21T19:36:59Z

Hmm do you actually see a performance degradation with < 256 bytes? If not using ZVA always when it's available and aligned region is >= 64 would simplify your logic a bit wouldn't it?

Yes. First, I tried to unconditionally allow using DC ZVA for regions >= 192 bytes. But given that the JIT would still need to emit 4 stp q-reg for zeroing the first and the last 64 bytes of such region the overhead of those was outweighing the performance benefit of the instruction. Then, I did two experiments - JIT would emit a loop with DC ZVA only when the loop would run for at least 2 or 3 iterations - that means DC ZVA was allowed for regions >= 256 or >= 320 bytes, correspondingly. I compared them and found the former to be slightly more performant.

I am not sure I understand how I could simplify the logic. As far as I know, JIT can't tell whether an initialized memory region is aligned to 64 bytes boundaries. The best we can do is to assume that the frame pointer is aligned to 16 bytes boundary and extend this assumption to the region in case when untrLclLo/untrLclHi are 16-multipliers. JIT would still need to manually initialize first and last 64 bytes with stp q-reg, right?

As for R2R scenarios - if we don't know whether DC ZVA is going to be supported on a target machine and properly configured (i.e. its block size is 64 bytes) then JIT in AOT scenario issues a loop with pair of stp q-reg. Hence JIT still needs to support two different ways of zeroing the frame - with and without DC ZVA use.

echesakov · 2021-01-21T19:43:26Z

So there are a few parts to this question. It's theoretically not enforced anywhere that the little and big cores have the same cache line sizes. Practically speaking however, almost all systems have this. However those that don't can't ever use ZVA because you don't know when the kernel migrates you. You could be migrated after you read the register but before you clear it.

It's a fairly dangerous setup when the cores don't match. To deal with this, in the Linux kernel DCZID_EL0 is virtualized and its turned off when the cache lines don't match. I don't know what the Windows kernel does here, but I assume it must do something similar for correctness.

Thank you for the reply, Tamar! If that is a case, then I guess DCZID_EL0<4> would signal than DC ZVA is prohibited to use?

echesakov · 2021-01-21T19:49:10Z

@dotnet/jit-contrib This is ready for review, PTAL.

BruceForstall

LGTM. I wonder if the loop versions, especially the larger loop versions, should be activated for smaller sizes under a JIT stress mode.

We're not going to see the DC ZVA instruction except for native hosted JIT scenarios (or maybe SPMI playback of arm64 native PMI/test run collections). Should there be a way to generate this for altjit scenarios?

BruceForstall · 2021-01-23T02:34:45Z

src/coreclr/inc/corinfoinstructionset.h

-    InstructionSet_Rdm_Arm64=17,
-    InstructionSet_Sha1_Arm64=18,
-    InstructionSet_Sha256_Arm64=19,
+    InstructionSet_Dczva=12,


You need to change the JIT/EE version GUID.

TamarChristinaArm · 2021-01-25T17:48:57Z

Yes. First, I tried to unconditionally allow using DC ZVA for regions >= 192 bytes. But given that the JIT would still need to emit 4 stp q-reg for zeroing the first and the last 64 bytes of such region the overhead of those was outweighing the performance benefit of the instruction. Then, I did two experiments - JIT would emit a loop with DC ZVA only when the loop would run for at least 2 or 3 iterations - that means DC ZVA was allowed for regions >= 256 or >= 320 bytes, correspondingly. I compared them and found the former to be slightly more performant.

Yeah, You can squeeze it to 160 by using overlapping stores (which is what AoR gets away with) but I have been told that older cores do not like this very much. So I think your approach is correct (and seems to match the threshold where the glibc ifunc with ZVA enabled is used) as I think you guys are tuning for as broad range of cores as possible.

I am not sure I understand how I could simplify the logic. As far as I know, JIT can't tell whether an initialized memory region is aligned to 64 bytes boundaries.

No, I think the confusion I had was my initial look I had misread the remainder check. It does look as optimal as possible which wouldn't inadvertently slow down older cores.

Thank you for the reply, Tamar! If that is a case, then I guess DCZID_EL0<4> would signal than DC ZVA is prohibited to use?

Yeah, that bit would then be set.

BruceForstall · 2021-01-25T18:55:58Z

@janvorli maybe you could double-check the VM side changes. And, fyi about the zeroing algorithm change.

…tFrame in src/coreclr/jit/codegencommon.cpp

…genarm64.cpp

…e in src/coreclr/jit/codegencommon.cpp

…ncommon.cpp

…mmon.cpp src/coreclr/jit/target.h

…nSet_Dczva flag in src/coreclr/vm/arm64/asmhelpers.S src/coreclr/vm/arm64/asmhelpers.asm src/coreclr/vm/codeman.cpp

…rc/coreclr/tools/Common/Internal/Runtime/ReadyToRunInstructionSetHelper.cs src/coreclr/tools/Common/JitInterface/CorInfoInstructionSet.cs src/coreclr/tools/Common/JitInterface/ThunkGenerator/InstructionSetDesc.txt

…/jit/emitfmtsarm64.h src/coreclr/jit/instrsarm64.h

…lr/jit/emitarm64.cpp

…cpp src/coreclr/jit/target.h

…common.cpp

…ncommon.cpp

echesakov · 2021-01-26T02:41:53Z

Had to rebase on top of master in order to update jiteeversionguid.h

echesakov · 2021-01-26T19:27:59Z

cc @jeffschwMSFT @richlander @jkotas who might be interested in this work

kunalspathak

Minor feedback and a question.

kunalspathak · 2021-01-26T19:17:41Z

src/coreclr/jit/codegencommon.cpp

+        //   3) if simdRegZeroed is true then 128-bit wide zeroSimdReg contains zeroes.
+
+        const int bytesUseZeroingLoop = 192;
+


Could you use #define for the numeric constants like 192, 256, etc.?

I could, but why that would be more preferred than using const int? I doubt that having these numbers outside of the algorithm implementation (e.g. in target.h) would be more useful

kunalspathak · 2021-01-26T19:30:01Z

src/coreclr/jit/codegencommon.cpp

+            //     add     x9, fp, #(untrLclLo-32)
+            //     mov     x10, #(bytesToWrite-64)
+            //
+            // loop:


Did you verify if aligning the loop gives any benefits as @TamarChristinaArm pointed?

I did two experiments with aligning the head of a loop on 16-bytes and 32-bytes boundaries - neither demonstrated measurable performance differences, so I decided not to implement the alignment.

AndyAyersMS · 2021-01-29T01:37:33Z

Nice work, @echesakovMSFT. We should be looking at microbenchmark perf early next week. Will make sure to post back here if we see anything there from your change.

A couple of things for follow up:

Is it worth considering this for large internal initobj/initblk too? And the zeroing part of a zeroing stackalloc?
@TamarChristinaArm are the writes done by dc zva guaranteed to atomically update with some granularity?

TamarChristinaArm · 2021-01-29T13:49:50Z

@TamarChristinaArm are the writes done by dc zva guaranteed to atomically update with some granularity?

@AndyAyersMS No, Architecturally there's nothing that makes any guarantee that DC ZVA writes are atomic or about the granularity of the final writes. Practically speaking the behavior of ZVA is quite core dependent and how each handle it can be quite different. The Cortex-A76 having a write streaming mode for instance makes it handle it quite differently from older cores.

echesakov · 2021-01-29T19:27:49Z

Is it worth considering this for large internal initobj/initblk too? And the zeroing part of a zeroing stackalloc?

@AndyAyersMS Yes, I was planning to look into other places where this instruction can be used and give us some performance improvements. In addition, we can improve JIT_MemSet implementation on windows-arm64 and use SIMD registers in same fashion as AoR does for larger memory regions.

echesakov added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Jan 6, 2021

echesakov added this to the 6.0.0 milestone Jan 6, 2021

echesakov self-assigned this Jan 6, 2021

echesakov linked an issue Jan 6, 2021 that may be closed by this pull request

[Arm64] Use stp and str (SIMD) for stack prolog zeroing #43789

Closed

echesakov mentioned this pull request Jan 12, 2021

[Arm64] Planned JIT work in .NET 6 #43629

Closed

29 tasks

runfoapp bot mentioned this pull request Jan 15, 2021

SslStreamNetworkStreamTest failing intermitently on OSX #46837

Closed

dotnet deleted a comment from azure-pipelines bot Jan 21, 2021

echesakov marked this pull request as ready for review January 21, 2021 19:48

BruceForstall requested changes Jan 23, 2021

View reviewed changes

TamarChristinaArm mentioned this pull request Jan 25, 2021

[Arm64] Use stp and str (SIMD) for stack prolog zeroing #43789

Closed

echesakov added 6 commits January 25, 2021 18:40

Refactor CodeGen::genZeroInitFrame in src/coreclr/jit/codegencommon.cpp

990e727

Use stp q9, q9, <addr-reg> for zeroing a frame in CodeGen::genZeroIni…

4450dfe

…tFrame in src/coreclr/jit/codegencommon.cpp

Fix comment for CodeGen::genInstrWithConstant in src/coreclr/jit/code…

6c30e8c

…genarm64.cpp

Leave the TARGET_ARM implementation of CodeGen::genZeroInitFrame asid…

085a1c5

…e in src/coreclr/jit/codegencommon.cpp

Format comment in CodeGen::genZeroInitFrame in src/coreclr/jit/codege…

4e6f31b

…ncommon.cpp

Re-implement CodeGen::genZeroInitFrame() in src/coreclr/jit/codegenco…

c7e3936

…mmon.cpp src/coreclr/jit/target.h

echesakov added 8 commits January 25, 2021 18:40

Implement GetDataCacheZeroIDReg() helper and use it to set Instructio…

a675e99

…nSet_Dczva flag in src/coreclr/vm/arm64/asmhelpers.S src/coreclr/vm/arm64/asmhelpers.asm src/coreclr/vm/codeman.cpp

Implement DC ZVA instruction in coreclr/jit/emitarm64.cpp src/coreclr…

5d10b15

…/jit/emitfmtsarm64.h src/coreclr/jit/instrsarm64.h

Allow second operand of BFM instruction to be XZR or WZR in src/corec…

527a5fc

…lr/jit/emitarm64.cpp

Use DC ZVA for regions >= 320 bytes in src/coreclr/jit/codegencommon.…

520e6db

…cpp src/coreclr/jit/target.h

Use DC ZVA for memory regions >= 256 bytes in src/coreclr/jit/codegen…

7568476

…common.cpp

Update comment in CodeGen::genZeroInitFrame in src/coreclr/jit/codege…

b472cc9

…ncommon.cpp

Update JIT/EE interface GUID in jiteeversionguid.h

2352f16

echesakov force-pushed the Arm64-Use-SIMDReg-ZeroInitFrame branch from 98452f6 to 2352f16 Compare January 26, 2021 02:41

BruceForstall approved these changes Jan 26, 2021

View reviewed changes

janvorli approved these changes Jan 26, 2021

View reviewed changes

kunalspathak reviewed Jan 26, 2021

View reviewed changes

echesakov merged commit c1d8656 into dotnet:master Jan 26, 2021

echesakov deleted the Arm64-Use-SIMDReg-ZeroInitFrame branch January 26, 2021 19:51

JulieLeeMSFT mentioned this pull request Jan 28, 2021

What's new in .NET 6 Preview 1 dotnet/core#5853

Closed

ghost locked as resolved and limited conversation to collaborators Feb 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Arm64] Use SIMD register to zero init frame #46609

[Arm64] Use SIMD register to zero init frame #46609

echesakov commented Jan 6, 2021 •

edited

Loading

echesakov commented Jan 15, 2021

TamarChristinaArm commented Jan 15, 2021

TamarChristinaArm commented Jan 15, 2021

echesakov commented Jan 16, 2021 •

edited

Loading

echesakov commented Jan 21, 2021

TamarChristinaArm commented Jan 21, 2021

echesakov commented Jan 21, 2021

echesakov commented Jan 21, 2021

echesakov commented Jan 21, 2021

BruceForstall left a comment

BruceForstall Jan 23, 2021

echesakov Jan 26, 2021

TamarChristinaArm commented Jan 25, 2021

BruceForstall commented Jan 25, 2021

echesakov commented Jan 26, 2021

echesakov commented Jan 26, 2021

kunalspathak left a comment

kunalspathak Jan 26, 2021

echesakov Jan 26, 2021

kunalspathak Jan 26, 2021

echesakov Jan 26, 2021

AndyAyersMS commented Jan 29, 2021

TamarChristinaArm commented Jan 29, 2021

echesakov commented Jan 29, 2021

		// 3) if simdRegZeroed is true then 128-bit wide zeroSimdReg contains zeroes.

		const int bytesUseZeroingLoop = 192;

[Arm64] Use SIMD register to zero init frame #46609

[Arm64] Use SIMD register to zero init frame #46609

Conversation

echesakov commented Jan 6, 2021 • edited Loading

echesakov commented Jan 15, 2021

TamarChristinaArm commented Jan 15, 2021

TamarChristinaArm commented Jan 15, 2021

echesakov commented Jan 16, 2021 • edited Loading

echesakov commented Jan 21, 2021

TamarChristinaArm commented Jan 21, 2021

echesakov commented Jan 21, 2021

echesakov commented Jan 21, 2021

echesakov commented Jan 21, 2021

BruceForstall left a comment

Choose a reason for hiding this comment

BruceForstall Jan 23, 2021

Choose a reason for hiding this comment

echesakov Jan 26, 2021

Choose a reason for hiding this comment

TamarChristinaArm commented Jan 25, 2021

BruceForstall commented Jan 25, 2021

echesakov commented Jan 26, 2021

echesakov commented Jan 26, 2021

kunalspathak left a comment

Choose a reason for hiding this comment

kunalspathak Jan 26, 2021

Choose a reason for hiding this comment

echesakov Jan 26, 2021

Choose a reason for hiding this comment

kunalspathak Jan 26, 2021

Choose a reason for hiding this comment

echesakov Jan 26, 2021

Choose a reason for hiding this comment

AndyAyersMS commented Jan 29, 2021

TamarChristinaArm commented Jan 29, 2021

echesakov commented Jan 29, 2021

echesakov commented Jan 6, 2021 •

edited

Loading

echesakov commented Jan 16, 2021 •

edited

Loading