Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few optimizations for the gcinfodecoder construction #96150

Merged
merged 13 commits into from
Dec 28, 2023
Merged

Conversation

VSadov
Copy link
Member

@VSadov VSadov commented Dec 18, 2023

Constructing gcinfodecoder performs some initial decoding. While the time spent in initial decoding is typically less than the time spent enumerating live slots, it is not insignificant. There are some opportunities to do initial decoding a bit cheaper.

DWORD lzcountCeil;
_BitScanReverse(&lzcountCeil, (unsigned long)x);
#else // _MSC_VER
UINT32 lzcountCeil = (UINT32)__builtin_clz((unsigned int)x);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe __builtin_clz already encodes the subtract from BITS_PER_SIZE_T within the intrinsic function unlike _BitScanReverse

Copy link
Member Author

@VSadov VSadov Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I've just realized that even though lzcnt is encoded similarly to bsr on x64, the result is an offset from different ends.

@VSadov
Copy link
Member Author

VSadov commented Dec 19, 2023

To measure the impact I use the following microbenchmark compiled with NativeAOT
(NativeAOT is used to reduce impact/noise from suspension)

The numbers are averaged GC Gen0 pauses in milliseconds. Lower is better.
On x64, Windows10, AMD 5950X, 16 cores, 32 logical

I see ~ 20% improvement.

==== Before the change:

0.1528027343750001
0.15324121093750026
0.15360449218750016
0.14911621093750022
0.15105566406250034
0.15213964843750036
0.15154882812500012
. . .

=== After the change:

0.12535742187500004
0.12644238281249998
0.12518164062499954
0.12546582031250003
0.12549902343750005
0.12462207031249978
. . .

@@ -275,6 +276,7 @@ class BitStreamReader
m_InitialRelPos = other.m_InitialRelPos;
m_pCurrent = other.m_pCurrent;
m_RelPos = other.m_RelPos;
m_current = other.m_current;
Copy link
Member Author

@VSadov VSadov Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 64bit one native word can act as a "buffer" for quite a few reads when each read takes only a few bits. This change reduces the need for indirect reads from the bitstream and may allow the compiler to enregister the "buffer".

// NOTE: This routine is perf-critical
__forceinline size_t ReadOneFast()
{
SUPPORTS_DAC;

size_t result = (*m_pCurrent) & (((size_t)1) << m_RelPos);
size_t result = m_current & 1;
m_current >>= 1;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this change is to use a fixed-size shift, which is typically faster than a variable-sized shift.
Same applies to Read( int numBits ) when we read a fixed sized nibble.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you pay for it by extra memory writes. Is it really a win at the end?

For example, here are timings for Intel 11th gen from
https://www.agner.org/optimize/instruction_tables.pdf (page 350):

| SHR SHL SAR r,i | 1 | 1 |
| SHR SHL SAR m,i | 4 | 4 |
| SHR SHL SAR r,cl | 2 | 2 |
| SHR SHL SAR m,cl | 5 | 6 |

Constant shift of memory is twice the cost of variable shift of register.

Copy link
Member Author

@VSadov VSadov Dec 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have to recheck the original codegen, but I think what was happening is that we would do indirect read and then apply a mask that was constructed via a variable shift of 1.
I guess that was because we need the result in a register and do not want to change the bit stream and the ways m_pCurrent and m_RelPos were changing did not allow to hoist/CSE/enregister either the result of the indirect read nor the computed mask.

Copy link
Member Author

@VSadov VSadov Dec 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly for Zen3 the table gives no difference whatsoever between immediate and CL shift versions.

SHL, SHR, SAR r,i/CL | 1 | 1 | 0.5

yet profiler finds CL shift operations relatively expensive.
I often see that in other code, so I always assumed that varaible shifts are somewhat costly.

Maybe it is not the instruction itself, but the whole sequence of dependent instructions - load 1, load CL, make a mask, do a read, apply the mask, and sampling profiler just attributes most of that to the shift.

The code does get measurably faster after the change. I had a few other changes that did not improve anything so I did not include them, but this one definitely helped.

c++ inlines and interleaves statement parts quite aggressively, so it is a bit hard to see what belongs where, but I see that a few "expensive" CL shifts are gone.
Here is an example of what happened to a loop like:

                for(UINT32 i = 0; i < numSlots; i++)
                {
                    if(m_Reader.ReadOneFast())
                        numCouldBeLiveSlots++;
                }

It is not an example of expensive/hot code, just something that is easier to read and see how codegen changed.

==== original code

00007FF700CE9C4E  test        r13d,r13d  
00007FF700CE9C51  je          GcInfoDecoder::EnumerateLiveSlots+898h (07FF700CE9C98h)  
00007FF700CE9C53  mov         r8d,r13d  
00007FF700CE9C56  nop         word ptr [rax+rax]  

00007FF700CE9C60  mov         ecx,r15d  
00007FF700CE9C63  mov         rax,r11        ;  r11 has `1` in it, assigned outside of the loop
00007FF700CE9C66  shl         rax,cl                   ;  depends on 2 instructions above
00007FF700CE9C69  and         rax,qword ptr [rdx]       ; depends on instruction above + read (L1 is 3-5 cycles)
00007FF700CE9C6C  inc         r15d                           
00007FF700CE9C6F  mov         dword ptr [rdi+18h],r15d      
00007FF700CE9C73  cmp         r15d,40h  
00007FF700CE9C77  jne         GcInfoDecoder::EnumerateLiveSlots+88Bh (07FF700CE9C8Bh)  
00007FF700CE9C79  add         rdx,8                  ; moving to the next word, relatively rare (1 in 64 reads)
00007FF700CE9C7D  mov         dword ptr [rdi+18h],0  
00007FF700CE9C84  mov         qword ptr [rdi+10h],rdx  
00007FF700CE9C88  xor         r15d,r15d  
00007FF700CE9C8B  test        rax,rax  
00007FF700CE9C8E  je          GcInfoDecoder::EnumerateLiveSlots+893h (07FF700CE9C93h)  
00007FF700CE9C90  inc         r12d  
00007FF700CE9C93  sub         r8,r11  

vs.

==== new

00007FF688D6A454  test        r12d,r12d  
00007FF688D6A457  je          GcInfoDecoder::EnumerateLiveSlots+0C04h (07FF688D6A4B4h)  
00007FF688D6A459  mov         rax,r11  
00007FF688D6A45C  mov         edx,r12d  
00007FF688D6A45F  nop  
00007FF688D6A460  mov         rcx,rax  
00007FF688D6A463  inc         r8d               ;  can run together or even before the previous instruction
00007FF688D6A466  shr         rax,1             ;  this and the next can run at the same time
00007FF688D6A469  and         ecx,1             ;  both only depend on "mov         rcx,rax"
00007FF688D6A46C  mov         qword ptr [rbx+20h],rax ; we do writes, but we do not need to wait for them
00007FF688D6A470  mov         dword ptr [rbx+18h],r8d
00007FF688D6A474  mov         qword ptr [rsp+48h],rax ; not sure what is stored here, in other cases we have just 2 writes
00007FF688D6A479  cmp         r8d,40h  
00007FF688D6A47D  jne         GcInfoDecoder::EnumerateLiveSlots+0BEBh (07FF688D6A49Bh)  
00007FF688D6A47F  add         qword ptr [rbx+10h],8     ; moving to the next word, relatively rare (1 in 64 reads)
00007FF688D6A484  mov         rax,qword ptr [rbx+10h]  
00007FF688D6A488  xor         r8d,r8d  
00007FF688D6A48B  mov         rax,qword ptr [rax]  
00007FF688D6A48E  mov         qword ptr [rbx+20h],rax  
00007FF688D6A492  mov         qword ptr [rsp+48h],rax  
00007FF688D6A497  mov         dword ptr [rbx+18h],r8d  
00007FF688D6A49B  test        rcx,rcx  
00007FF688D6A49E  je          GcInfoDecoder::EnumerateLiveSlots+0BF3h (07FF688D6A4A3h)  
00007FF688D6A4A0  inc         r13d  
00007FF688D6A4A3  sub         rdx,1  

bool m_IsInterruptible;
bool m_IsVarArg;
bool m_GenericSecretParamIsMD;
bool m_GenericSecretParamIsMT;
Copy link
Member Author

@VSadov VSadov Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are derived from masking the header flags. Masking is cheap and we may not even be asked for these, so we can do masking in the accessors.

// Use flag mask to bail out early if we already decoded all the pieces that caller requested
int remainingFlags = flags == DECODE_EVERYTHING ? ~0 : flags;

if (!slimHeader)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Predecoding a fat header could be much more involved compared to slim headers, so let's deal with fat headers in a separate helper.

@VSadov
Copy link
Member Author

VSadov commented Dec 19, 2023

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@VSadov VSadov marked this pull request as ready for review December 20, 2023 06:37
@VSadov
Copy link
Member Author

VSadov commented Dec 20, 2023

I think this is ready for review.

@VSadov VSadov requested a review from jkotas December 25, 2023 04:51
src/coreclr/inc/gcinfodecoder.h Show resolved Hide resolved
// NOTE: This routine is perf-critical
__forceinline size_t ReadOneFast()
{
SUPPORTS_DAC;

size_t result = (*m_pCurrent) & (((size_t)1) << m_RelPos);
size_t result = m_current & 1;
m_current >>= 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you pay for it by extra memory writes. Is it really a win at the end?

For example, here are timings for Intel 11th gen from
https://www.agner.org/optimize/instruction_tables.pdf (page 350):

| SHR SHL SAR r,i | 1 | 1 |
| SHR SHL SAR m,i | 4 | 4 |
| SHR SHL SAR r,cl | 2 | 2 |
| SHR SHL SAR m,cl | 5 | 6 |

Constant shift of memory is twice the cost of variable shift of register.

src/coreclr/inc/gcinfodecoder.h Show resolved Hide resolved
@@ -383,54 +415,81 @@ bool GcInfoDecoder::IsSafePoint(UINT32 codeOffset)
if(m_NumSafePoints == 0)
return false;

#if defined(TARGET_AMD64) || defined(TARGET_ARM) || defined(TARGET_ARM64) || defined(TARGET_LOONGARCH64) || defined(TARGET_RISCV64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removal of this ifdef is changing behavior for Linux x86. I guess that it is a bug fix. I do not see a reason why safepoint handling should be different for Linux x86.

cc @gbalykov

Copy link
Member Author

@VSadov VSadov Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed we can't get here with x86 since we use a different gcinfo encoder.
If that is not the case and we can on Linux, I will put this back.

Overall, I think this -1 hack is trying to solve a problem that does not exist, and it would be better if we could stop doing/undoing this artificial adjustment, but it is not relevant to this PR. I just thought that #if defined is always true here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas thanks for sharing this. Currently almost all clr tests fail on linux x86 on main branch, so there're other issues with it that need to be investigated first. We'll probably take a look at this change after that.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@VSadov
Copy link
Member Author

VSadov commented Dec 28, 2023

It looks like the linux-arm64 Release NativeAOT_Libraries is stuck in a loop repeating the same:

  Done building Helix work items. Work item count: 9
  Starting Azure Pipelines Test Run net9.0-linux-Release-arm64-NativeAOT_Release-(Ubuntu.2204.Arm64.Open)Ubuntu.2004.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-22.04-helix-arm64v8
  Job e739cf8d-0586-4722-b4a3-61c7ae9aaf97 on (Debian.11.Arm64.Open)Ubuntu.2004.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:debian-11-helix-arm64v8 is completed with 10 finished work items.
  Using _DebuggerHosts: chrome
  Using Queues: (Ubuntu.2204.Arm64.Open)Ubuntu.2004.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-22.04-helix-arm64v8+(Debian.11.Arm64.Open)Ubuntu.2004.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:debian-11-helix-arm64v8
  BuildTargetFramework: net9.0
  TestArchiveTestsRoot: /__w/1/s/artifacts/helix/tests/
  TestArchiveRoot: /__w/1/s/artifacts/helix/
  TestArchiveRuntimeRoot: /__w/1/s/artifacts/helix/runtime/
  TestArchiveRuntimeFile: /__w/1/s/artifacts/helix/runtime/test-runtime-net9.0-linux-Release-arm64.zip
  Compressing runtime directory
  Creating directory /__w/1/s/artifacts/helix/runtime/
  Zipping directory "/__w/1/s/artifacts/bin/testhost/net9.0-linux-Release-arm64/" to "/__w/1/s/artifacts/helix/runtime/test-runtime-net9.0-linux-Release-arm64.zip".
  Using Queues: (Ubuntu.2204.Arm64.Open)Ubuntu.2004.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-22.04-helix-arm64v8+(Debian.11.Arm64.Open)Ubuntu.2004.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:debian-11-helix-arm64v8
  Build TargetFramework: net9.0
  Building Helix work items
  Using TestRunNamePrefix: net9.0-linux-Release-arm64-NativeAOT_Release-
  Using HelixCorrelationPayload: /__w/1/s/artifacts/helix/runtime/test-runtime-net9.0-linux-Release-arm64.zip
  Using HelixCommand: ./RunTests.sh --runtime-path "$HELIX_CORRELATION_PAYLOAD"
  Using HelixType: test/functional/cli/innerloop/
  Using WorkItemArchiveWildCard: /__w/1/s/artifacts/helix/tests/**/*.zip
  Using Timeout: 00:45:00
  Done building Helix work items. Work item count: 9

https://dev.azure.com/dnceng-public/public/_build/results?buildId=509866&view=logs&j=74bca3e6-08a5-5cef-0121-fe27fe79cc7b&t=317b4f3c-303e-557e-6c7e-66c36fe29909

It is not happening in my newer test run (#85694), so perhaps it is something that got fixed recently.

@VSadov VSadov merged commit b4ba5da into dotnet:main Dec 28, 2023
109 of 111 checks passed
@VSadov VSadov deleted the dec branch December 28, 2023 21:44
@VSadov
Copy link
Member Author

VSadov commented Dec 28, 2023

Thanks!!!

@github-actions github-actions bot locked and limited conversation to collaborators Feb 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants