A few optimizations for the gcinfodecoder construction #96150

VSadov · 2023-12-18T21:02:52Z

Constructing gcinfodecoder performs some initial decoding. While the time spent in initial decoding is typically less than the time spent enumerating live slots, it is not insignificant. There are some opportunities to do initial decoding a bit cheaper.

davidwrighton · 2023-12-19T00:41:39Z

src/coreclr/inc/gcinfotypes.h

+    DWORD lzcountCeil;
+    _BitScanReverse(&lzcountCeil, (unsigned long)x);
+#else // _MSC_VER
+    UINT32 lzcountCeil = (UINT32)__builtin_clz((unsigned int)x);


I believe __builtin_clz already encodes the subtract from BITS_PER_SIZE_T within the intrinsic function unlike _BitScanReverse

right, I've just realized that even though lzcnt is encoded similarly to bsr on x64, the result is an offset from different ends.

VSadov · 2023-12-19T20:33:02Z

To measure the impact I use the following microbenchmark compiled with NativeAOT
(NativeAOT is used to reduce impact/noise from suspension)

The numbers are averaged GC Gen0 pauses in milliseconds. Lower is better.
On x64, Windows10, AMD 5950X, 16 cores, 32 logical

I see ~ 20% improvement.

==== Before the change:

0.1528027343750001
0.15324121093750026
0.15360449218750016
0.14911621093750022
0.15105566406250034
0.15213964843750036
0.15154882812500012
. . .

=== After the change:

0.12535742187500004
0.12644238281249998
0.12518164062499954
0.12546582031250003
0.12549902343750005
0.12462207031249978
. . .

VSadov · 2023-12-19T23:18:56Z

src/coreclr/inc/gcinfodecoder.h

@@ -275,6 +276,7 @@ class BitStreamReader
        m_InitialRelPos = other.m_InitialRelPos;
        m_pCurrent = other.m_pCurrent;
        m_RelPos = other.m_RelPos;
+        m_current = other.m_current;


On 64bit one native word can act as a "buffer" for quite a few reads when each read takes only a few bits. This change reduces the need for indirect reads from the bitstream and may allow the compiler to enregister the "buffer".

VSadov · 2023-12-19T23:20:00Z

src/coreclr/inc/gcinfodecoder.h

    // NOTE: This routine is perf-critical
    __forceinline size_t ReadOneFast()
    {
        SUPPORTS_DAC;

-        size_t result = (*m_pCurrent) & (((size_t)1) << m_RelPos);
+        size_t result = m_current & 1;
+        m_current >>= 1;


The point of this change is to use a fixed-size shift, which is typically faster than a variable-sized shift.
Same applies to Read( int numBits ) when we read a fixed sized nibble.

Yes, but you pay for it by extra memory writes. Is it really a win at the end?

For example, here are timings for Intel 11th gen from
https://www.agner.org/optimize/instruction_tables.pdf (page 350):

| SHR SHL SAR r,i | 1 | 1 | | SHR SHL SAR m,i | 4 | 4 | | SHR SHL SAR r,cl | 2 | 2 | | SHR SHL SAR m,cl | 5 | 6 |

Constant shift of memory is twice the cost of variable shift of register.

I will have to recheck the original codegen, but I think what was happening is that we would do indirect read and then apply a mask that was constructed via a variable shift of 1.
I guess that was because we need the result in a register and do not want to change the bit stream and the ways m_pCurrent and m_RelPos were changing did not allow to hoist/CSE/enregister either the result of the indirect read nor the computed mask.

Interestingly for Zen3 the table gives no difference whatsoever between immediate and CL shift versions.

SHL, SHR, SAR r,i/CL | 1 | 1 | 0.5

yet profiler finds CL shift operations relatively expensive.
I often see that in other code, so I always assumed that varaible shifts are somewhat costly.

Maybe it is not the instruction itself, but the whole sequence of dependent instructions - load 1, load CL, make a mask, do a read, apply the mask, and sampling profiler just attributes most of that to the shift.

The code does get measurably faster after the change. I had a few other changes that did not improve anything so I did not include them, but this one definitely helped.

c++ inlines and interleaves statement parts quite aggressively, so it is a bit hard to see what belongs where, but I see that a few "expensive" CL shifts are gone.
Here is an example of what happened to a loop like:

for(UINT32 i = 0; i < numSlots; i++) { if(m_Reader.ReadOneFast()) numCouldBeLiveSlots++; }

It is not an example of expensive/hot code, just something that is easier to read and see how codegen changed.

==== original code

00007FF700CE9C4E test r13d,r13d 00007FF700CE9C51 je GcInfoDecoder::EnumerateLiveSlots+898h (07FF700CE9C98h) 00007FF700CE9C53 mov r8d,r13d 00007FF700CE9C56 nop word ptr [rax+rax] 00007FF700CE9C60 mov ecx,r15d 00007FF700CE9C63 mov rax,r11 ; r11 has `1` in it, assigned outside of the loop 00007FF700CE9C66 shl rax,cl ; depends on 2 instructions above 00007FF700CE9C69 and rax,qword ptr [rdx] ; depends on instruction above + read (L1 is 3-5 cycles) 00007FF700CE9C6C inc r15d 00007FF700CE9C6F mov dword ptr [rdi+18h],r15d 00007FF700CE9C73 cmp r15d,40h 00007FF700CE9C77 jne GcInfoDecoder::EnumerateLiveSlots+88Bh (07FF700CE9C8Bh) 00007FF700CE9C79 add rdx,8 ; moving to the next word, relatively rare (1 in 64 reads) 00007FF700CE9C7D mov dword ptr [rdi+18h],0 00007FF700CE9C84 mov qword ptr [rdi+10h],rdx 00007FF700CE9C88 xor r15d,r15d 00007FF700CE9C8B test rax,rax 00007FF700CE9C8E je GcInfoDecoder::EnumerateLiveSlots+893h (07FF700CE9C93h) 00007FF700CE9C90 inc r12d 00007FF700CE9C93 sub r8,r11

vs.

==== new

00007FF688D6A454 test r12d,r12d 00007FF688D6A457 je GcInfoDecoder::EnumerateLiveSlots+0C04h (07FF688D6A4B4h) 00007FF688D6A459 mov rax,r11 00007FF688D6A45C mov edx,r12d 00007FF688D6A45F nop 00007FF688D6A460 mov rcx,rax 00007FF688D6A463 inc r8d ; can run together or even before the previous instruction 00007FF688D6A466 shr rax,1 ; this and the next can run at the same time 00007FF688D6A469 and ecx,1 ; both only depend on "mov rcx,rax" 00007FF688D6A46C mov qword ptr [rbx+20h],rax ; we do writes, but we do not need to wait for them 00007FF688D6A470 mov dword ptr [rbx+18h],r8d 00007FF688D6A474 mov qword ptr [rsp+48h],rax ; not sure what is stored here, in other cases we have just 2 writes 00007FF688D6A479 cmp r8d,40h 00007FF688D6A47D jne GcInfoDecoder::EnumerateLiveSlots+0BEBh (07FF688D6A49Bh) 00007FF688D6A47F add qword ptr [rbx+10h],8 ; moving to the next word, relatively rare (1 in 64 reads) 00007FF688D6A484 mov rax,qword ptr [rbx+10h] 00007FF688D6A488 xor r8d,r8d 00007FF688D6A48B mov rax,qword ptr [rax] 00007FF688D6A48E mov qword ptr [rbx+20h],rax 00007FF688D6A492 mov qword ptr [rsp+48h],rax 00007FF688D6A497 mov dword ptr [rbx+18h],r8d 00007FF688D6A49B test rcx,rcx 00007FF688D6A49E je GcInfoDecoder::EnumerateLiveSlots+0BF3h (07FF688D6A4A3h) 00007FF688D6A4A0 inc r13d 00007FF688D6A4A3 sub rdx,1

VSadov · 2023-12-19T23:21:48Z

src/coreclr/inc/gcinfodecoder.h

    bool    m_IsInterruptible;
-    bool    m_IsVarArg;
-    bool    m_GenericSecretParamIsMD;
-    bool    m_GenericSecretParamIsMT;


These are derived from masking the header flags. Masking is cheap and we may not even be asked for these, so we can do masking in the accessors.

VSadov · 2023-12-19T23:30:00Z

src/coreclr/vm/gcinfodecoder.cpp

+    // Use flag mask to bail out early if we already decoded all the pieces that caller requested
+    int remainingFlags = flags == DECODE_EVERYTHING ? ~0 : flags;
+
+    if (!slimHeader)


Predecoding a fat header could be much more involved compared to slim headers, so let's deal with fat headers in a separate helper.

VSadov · 2023-12-19T23:50:41Z

/azp run runtime-nativeaot-outerloop

azure-pipelines · 2023-12-19T23:50:54Z

Azure Pipelines successfully started running 1 pipeline(s).

VSadov · 2023-12-20T06:37:40Z

I think this is ready for review.

src/coreclr/inc/gcinfodecoder.h

jkotas · 2023-12-26T14:41:25Z

src/coreclr/inc/gcinfodecoder.h

    // NOTE: This routine is perf-critical
    __forceinline size_t ReadOneFast()
    {
        SUPPORTS_DAC;

-        size_t result = (*m_pCurrent) & (((size_t)1) << m_RelPos);
+        size_t result = m_current & 1;
+        m_current >>= 1;


Yes, but you pay for it by extra memory writes. Is it really a win at the end?

For example, here are timings for Intel 11th gen from
https://www.agner.org/optimize/instruction_tables.pdf (page 350):

| SHR SHL SAR r,i | 1 | 1 | | SHR SHL SAR m,i | 4 | 4 | | SHR SHL SAR r,cl | 2 | 2 | | SHR SHL SAR m,cl | 5 | 6 |

Constant shift of memory is twice the cost of variable shift of register.

src/coreclr/inc/gcinfodecoder.h

jkotas · 2023-12-28T05:33:38Z

src/coreclr/vm/gcinfodecoder.cpp

@@ -383,54 +415,81 @@ bool GcInfoDecoder::IsSafePoint(UINT32 codeOffset)
    if(m_NumSafePoints == 0)
        return false;

-#if defined(TARGET_AMD64) || defined(TARGET_ARM) || defined(TARGET_ARM64) || defined(TARGET_LOONGARCH64) || defined(TARGET_RISCV64)


Removal of this ifdef is changing behavior for Linux x86. I guess that it is a bug fix. I do not see a reason why safepoint handling should be different for Linux x86.

cc @gbalykov

I assumed we can't get here with x86 since we use a different gcinfo encoder.
If that is not the case and we can on Linux, I will put this back.

Overall, I think this -1 hack is trying to solve a problem that does not exist, and it would be better if we could stop doing/undoing this artificial adjustment, but it is not relevant to this PR. I just thought that #if defined is always true here.

@jkotas thanks for sharing this. Currently almost all clr tests fail on linux x86 on main branch, so there're other issues with it that need to be investigated first. We'll probably take a look at this change after that.

jkotas

LGTM. Thank you!

VSadov · 2023-12-28T21:34:11Z

It looks like the linux-arm64 Release NativeAOT_Libraries is stuck in a loop repeating the same:

  Done building Helix work items. Work item count: 9
  Starting Azure Pipelines Test Run net9.0-linux-Release-arm64-NativeAOT_Release-(Ubuntu.2204.Arm64.Open)Ubuntu.2004.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-22.04-helix-arm64v8
  Job e739cf8d-0586-4722-b4a3-61c7ae9aaf97 on (Debian.11.Arm64.Open)Ubuntu.2004.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:debian-11-helix-arm64v8 is completed with 10 finished work items.
  Using _DebuggerHosts: chrome
  Using Queues: (Ubuntu.2204.Arm64.Open)Ubuntu.2004.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-22.04-helix-arm64v8+(Debian.11.Arm64.Open)Ubuntu.2004.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:debian-11-helix-arm64v8
  BuildTargetFramework: net9.0
  TestArchiveTestsRoot: /__w/1/s/artifacts/helix/tests/
  TestArchiveRoot: /__w/1/s/artifacts/helix/
  TestArchiveRuntimeRoot: /__w/1/s/artifacts/helix/runtime/
  TestArchiveRuntimeFile: /__w/1/s/artifacts/helix/runtime/test-runtime-net9.0-linux-Release-arm64.zip
  Compressing runtime directory
  Creating directory /__w/1/s/artifacts/helix/runtime/
  Zipping directory "/__w/1/s/artifacts/bin/testhost/net9.0-linux-Release-arm64/" to "/__w/1/s/artifacts/helix/runtime/test-runtime-net9.0-linux-Release-arm64.zip".
  Using Queues: (Ubuntu.2204.Arm64.Open)Ubuntu.2004.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-22.04-helix-arm64v8+(Debian.11.Arm64.Open)Ubuntu.2004.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:debian-11-helix-arm64v8
  Build TargetFramework: net9.0
  Building Helix work items
  Using TestRunNamePrefix: net9.0-linux-Release-arm64-NativeAOT_Release-
  Using HelixCorrelationPayload: /__w/1/s/artifacts/helix/runtime/test-runtime-net9.0-linux-Release-arm64.zip
  Using HelixCommand: ./RunTests.sh --runtime-path "$HELIX_CORRELATION_PAYLOAD"
  Using HelixType: test/functional/cli/innerloop/
  Using WorkItemArchiveWildCard: /__w/1/s/artifacts/helix/tests/**/*.zip
  Using Timeout: 00:45:00
  Done building Helix work items. Work item count: 9

https://dev.azure.com/dnceng-public/public/_build/results?buildId=509866&view=logs&j=74bca3e6-08a5-5cef-0121-fe27fe79cc7b&t=317b4f3c-303e-557e-6c7e-66c36fe29909

It is not happening in my newer test run (#85694), so perhaps it is something that got fixed recently.

VSadov · 2023-12-28T21:44:59Z

Thanks!!!

VSadov added 6 commits December 18, 2023 13:00

faster ReadOneFast

bdde1b7

lzcnt

12b491b

linear search for safepoints

9f4c0c3

do not "decode" individual bool flags

43875b6

factored out predecoding of fat header

7bd79a6

tweaks and comments

e443e82

VSadov added the area-VM-coreclr label Dec 18, 2023

ghost assigned VSadov Dec 18, 2023

build-analysis bot mentioned this pull request Dec 18, 2023

[8.0] Failing to build native components #94823

Closed

fix arm64 build

a7ca3a5

davidwrighton reviewed Dec 19, 2023

View reviewed changes

build-analysis bot mentioned this pull request Dec 19, 2023

An error occurred trying to start process 'llvm-symbolizer' #91975

Open

use intrinsics correctly

14ef9e1

VSadov force-pushed the dec branch from 39952cf to 14ef9e1 Compare December 19, 2023 05:26

jkotas added the tenet-performance Performance related issue label Dec 19, 2023

build-analysis bot mentioned this pull request Dec 19, 2023

Failing test - System.Threading.ThreadPools.Tests.ThreadPoolTests.ThreadPoolMinMaxThreadsEventTest on Linux/arm64 #95873

Closed

runfoapp bot mentioned this pull request Dec 19, 2023

Methodical_others test JIT/Methodical/Coverage/copy_prop_byref_to_native_int crashing #69832

Open

VSadov force-pushed the dec branch from eb342f6 to f37685d Compare December 19, 2023 18:47

no need to search for safepoints when asked only for interruptibility

05b269b

VSadov force-pushed the dec branch from f37685d to 05b269b Compare December 19, 2023 18:54

fix for arm

5106b83

VSadov commented Dec 19, 2023

View reviewed changes

This was referenced Dec 20, 2023

Tracking issue for CI build timeouts #76454

Closed

Test failure - System.NullReferenceException in System.Threading.Lock.TryInitializeStatics #94728

Closed

small tweak for DecodeVarLengthUnsigned

902ed79

VSadov marked this pull request as ready for review December 20, 2023 06:37

VSadov requested a review from jkotas December 25, 2023 04:51

jkotas reviewed Dec 26, 2023

View reviewed changes

removed SAFE_SHIFT_LEFT, SAFE_SHIFT_RIGHT

6ffcdaf

jkotas reviewed Dec 28, 2023

View reviewed changes

jkotas approved these changes Dec 28, 2023

View reviewed changes

undo unnecessary change.

53e5237

VSadov merged commit b4ba5da into dotnet:main Dec 28, 2023
109 of 111 checks passed

VSadov deleted the dec branch December 28, 2023 21:44

VSadov mentioned this pull request Dec 31, 2023

[NativeAOT] Faster lookup for method infos on Windows. #96283

Closed

github-actions bot locked and limited conversation to collaborators Feb 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A few optimizations for the gcinfodecoder construction #96150

A few optimizations for the gcinfodecoder construction #96150

VSadov commented Dec 18, 2023 •

edited

Loading

davidwrighton Dec 19, 2023

VSadov Dec 19, 2023 •

edited

Loading

VSadov commented Dec 19, 2023 •

edited

Loading

VSadov Dec 19, 2023 •

edited

Loading

VSadov Dec 19, 2023

jkotas Dec 26, 2023

VSadov Dec 26, 2023 •

edited

Loading

VSadov Dec 26, 2023 •

edited

Loading

VSadov Dec 19, 2023 •

edited

Loading

VSadov Dec 19, 2023

VSadov commented Dec 19, 2023

azure-pipelines bot commented Dec 19, 2023

VSadov commented Dec 20, 2023

jkotas Dec 26, 2023

jkotas Dec 28, 2023

VSadov Dec 28, 2023 •

edited

Loading

gbalykov Jan 4, 2024

jkotas left a comment

VSadov commented Dec 28, 2023 •

edited

Loading

VSadov commented Dec 28, 2023

A few optimizations for the gcinfodecoder construction #96150

A few optimizations for the gcinfodecoder construction #96150

Conversation

VSadov commented Dec 18, 2023 • edited Loading

davidwrighton Dec 19, 2023

Choose a reason for hiding this comment

VSadov Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

VSadov commented Dec 19, 2023 • edited Loading

VSadov Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

VSadov Dec 19, 2023

Choose a reason for hiding this comment

jkotas Dec 26, 2023

Choose a reason for hiding this comment

VSadov Dec 26, 2023 • edited Loading

Choose a reason for hiding this comment

VSadov Dec 26, 2023 • edited Loading

Choose a reason for hiding this comment

VSadov Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

VSadov Dec 19, 2023

Choose a reason for hiding this comment

VSadov commented Dec 19, 2023

azure-pipelines bot commented Dec 19, 2023

VSadov commented Dec 20, 2023

jkotas Dec 26, 2023

Choose a reason for hiding this comment

jkotas Dec 28, 2023

Choose a reason for hiding this comment

VSadov Dec 28, 2023 • edited Loading

Choose a reason for hiding this comment

gbalykov Jan 4, 2024

Choose a reason for hiding this comment

jkotas left a comment

Choose a reason for hiding this comment

VSadov commented Dec 28, 2023 • edited Loading

VSadov commented Dec 28, 2023

VSadov commented Dec 18, 2023 •

edited

Loading

VSadov Dec 19, 2023 •

edited

Loading

VSadov commented Dec 19, 2023 •

edited

Loading

VSadov Dec 19, 2023 •

edited

Loading

VSadov Dec 26, 2023 •

edited

Loading

VSadov Dec 26, 2023 •

edited

Loading

VSadov Dec 19, 2023 •

edited

Loading

VSadov Dec 28, 2023 •

edited

Loading

VSadov commented Dec 28, 2023 •

edited

Loading