Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avx512 extract most significant bits #82731

Merged

Conversation

anthonycanino
Copy link
Contributor

PR addresses #80820.

This adds a new type, TYP_MASK which represents Intel opmask k registers on xarch platform. Enough infrastructure was added so that Vector512.ExtractMostSignificantBits could be hardware accelerated on AVX512 capable platforms, which requires the use of the k-mask registers.

A future PR will implement mask registers in full, which depends upon VectorMask API proposal. At the moment, only a single mask register will ever be used, and is used as an internal register for the ExtractMostSignificantBits method. Follow up PRs will extend this to the Vector512 comparison operators, which require a similar approach.

As an example, the following code...

ulong x = v1.ExtractMostSignificantBits(); // v1 is `Vector512<long>`

roughly lowers to

vmovups zmm0, zmmword ptr[rcx]
vpmovq2m k1, zmm0
kmovb rax, k1

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Feb 27, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 27, 2023
@ghost
Copy link

ghost commented Feb 27, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

PR addresses #80820.

This adds a new type, TYP_MASK which represents Intel opmask k registers on xarch platform. Enough infrastructure was added so that Vector512.ExtractMostSignificantBits could be hardware accelerated on AVX512 capable platforms, which requires the use of the k-mask registers.

A future PR will implement mask registers in full, which depends upon VectorMask API proposal. At the moment, only a single mask register will ever be used, and is used as an internal register for the ExtractMostSignificantBits method. Follow up PRs will extend this to the Vector512 comparison operators, which require a similar approach.

As an example, the following code...

ulong x = v1.ExtractMostSignificantBits(); // v1 is `Vector512<long>`

roughly lowers to

vmovups zmm0, zmmword ptr[rcx]
vpmovq2m k1, zmm0
kmovb rax, k1
Author: anthonycanino
Assignees: -
Labels:

area-CodeGen-coreclr, community-contribution

Milestone: -

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the 1st pass. I have reviewed just the portion that is not part of #80960

src/coreclr/jit/vartype.h Show resolved Hide resolved
src/coreclr/jit/lsra.cpp Show resolved Hide resolved
src/coreclr/jit/lsrabuild.cpp Show resolved Hide resolved
src/coreclr/jit/lsrabuild.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/hwintrinsiclistxarch.h Show resolved Hide resolved
src/coreclr/jit/targetx86.h Show resolved Hide resolved
@ghost ghost added needs-author-action An issue or pull request that requires more info or actions from the author. and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Feb 28, 2023
@anthonycanino
Copy link
Contributor Author

@dotnet/avx512-contrib can we get a rerun? Looks like some stuff was timed out?

@anthonycanino
Copy link
Contributor Author

Looks like everything is passing, I have made the requested changes if you would like to give it another pass.

@kunalspathak
Copy link
Member

Changes LGTM. Once #80960 is merged, let us do a final superpmi measurement before merging.

@BruceForstall
Copy link
Member

@anthonycanino Please resolve the conflicts.

@anthonycanino anthonycanino force-pushed the avx512-extract-most-significant-bits branch from 6cf682d to 40a46eb Compare March 6, 2023 17:21
@anthonycanino
Copy link
Contributor Author

I just pulled the changes onto a fresh branch based on main after its dependent PR was merged.

@kunalspathak
Copy link
Member

Seems like a TP hit, possibly because of added checks for K-instruction in emitter? Can you double check?

image

Also, I am surprised to see diffs in arm64.

There are few things that should be under TARGET_XARCH like MaskRegisterType TYP_MASK, buildInternalMaskRegisterDefForNode and few more in lsra. Likewise caller of IsKInstruction should wrap it inside #ifdef TARGET_XARCH.

@anthonycanino
Copy link
Contributor Author

Seems like a TP hit, possibly because of added checks for K-instruction in emitter? Can you double check?

image

Also, I am surprised to see diffs in arm64.

There are few things that should be under TARGET_XARCH like MaskRegisterType TYP_MASK, buildInternalMaskRegisterDefForNode and few more in lsra. Likewise caller of IsKInstruction should wrap it inside #ifdef TARGET_XARCH.

Yes will do shortly.

src/coreclr/vm/threadsuspend.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/typelist.h Outdated Show resolved Hide resolved
@@ -64,6 +64,7 @@ DEF_TP(SIMD16 ,"simd16" , TYP_SIMD16, TI_STRUCT,16,16, 16, 4,16, VTF_S|VTF_
DEF_TP(SIMD32 ,"simd32" , TYP_SIMD32, TI_STRUCT,32,32, 32, 8,16, VTF_S|VTF_VEC)
DEF_TP(SIMD64 ,"simd64" , TYP_SIMD64, TI_STRUCT,64,64, 64, 16,16, VTF_S|VTF_VEC)
#endif // TARGET_XARCH
DEF_TP(MASK ,"mask" , TYP_MASK, TI_STRUCT,8, 8, 8, 2,8, VTF_S)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the type be VTF_S|VTF_VEC? That would let varTypeIsSIMD() be true; do we want that? E.g., it would mean varTypeUsesFloatReg returns true, which we probably don't want since TYP_MASK uses a new register type. @tannergooding

Also, I wonder if it should be VTF_S. It's not really a struct, is it, like the other SIMD types?

Maybe it should be VTF_ANY so it doesn't have any default handling. E.g., we probably don't want varTypeIsStruct() to return true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a struct and will eventually map to the VectorMask<T> type(s).

Under the hood we have just k0-k7 which are 64-bits and while the actual hardware manuals don't differentiate further than k#, the .NET and C++ intrinsics do. -- .NET has VectorMask#<T> while C++ has __mmask8/16/32/64. So VectorMask256<byte> is equivalent to __mmask16 since it has 16 entries.

Given that, is just MASK sufficient or do we need MASK8/16/32/64? If we have more than one, presumably we'd have this be VTF_S|VTF_MSK. -- I could see us going either way with there being different tradeoffs for each approach.

src/coreclr/jit/register.h Outdated Show resolved Hide resolved
src/coreclr/jit/vartype.h Show resolved Hide resolved
src/coreclr/jit/lsra.h Show resolved Hide resolved
src/coreclr/jit/emitxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/hwintrinsiccodegenxarch.cpp Show resolved Hide resolved
@anthonycanino
Copy link
Contributor Author

Seems like a TP hit, possibly because of added checks for K-instruction in emitter? Can you double check?

image

Also, I am surprised to see diffs in arm64.

There are few things that should be under TARGET_XARCH like MaskRegisterType TYP_MASK, buildInternalMaskRegisterDefForNode and few more in lsra. Likewise caller of IsKInstruction should wrap it inside #ifdef TARGET_XARCH.

FYI, tpdff with no changes...

Overall (+0.32% to +0.52%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.47%
aspnet_block.run.windows.x64.checked.mch +0.52%
benchmarks.run.windows.x64.checked.mch +0.34%
coreclr_tests.run.windows.x64.checked.mch +0.40%
libraries.crossgen2.windows.x64.checked.mch +0.43%
libraries.pmi.windows.x64.checked.mch +0.40%
libraries_tests.pmi.windows.x64.checked.mch +0.32%
MinOpts (+0.53% to +1.18%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.98%
aspnet_block.run.windows.x64.checked.mch +0.90%
benchmarks.run.windows.x64.checked.mch +0.74%
coreclr_tests.run.windows.x64.checked.mch +0.60%
libraries.crossgen2.windows.x64.checked.mch +1.18%
libraries.pmi.windows.x64.checked.mch +0.82%
libraries_tests.pmi.windows.x64.checked.mch +0.53%
FullOpts (+0.23% to +0.43%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.37%
aspnet_block.run.windows.x64.checked.mch +0.40%
benchmarks.run.windows.x64.checked.mch +0.34%
coreclr_tests.run.windows.x64.checked.mch +0.23%
libraries.crossgen2.windows.x64.checked.mch +0.43%
libraries.pmi.windows.x64.checked.mch +0.40%
libraries_tests.pmi.windows.x64.checked.mch +0.31%
Details

All contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 137,122,434,940 137,770,274,733 +0.47%
aspnet_block.run.windows.x64.checked.mch 28,196,864,957 28,342,998,178 +0.52%
benchmarks.run.windows.x64.checked.mch 55,715,871,918 55,907,406,382 +0.34%
coreclr_tests.run.windows.x64.checked.mch 820,707,835,085 823,964,475,471 +0.40%
libraries.crossgen2.windows.x64.checked.mch 123,754,250,291 124,288,544,069 +0.43%
libraries.pmi.windows.x64.checked.mch 234,167,937,468 235,100,838,193 +0.40%
libraries_tests.pmi.windows.x64.checked.mch 506,101,763,851 507,705,699,987 +0.32%

MinOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 22,723,762,098 22,946,475,537 +0.98%
aspnet_block.run.windows.x64.checked.mch 6,648,732,841 6,708,641,851 +0.90%
benchmarks.run.windows.x64.checked.mch 427,800,839 430,954,243 +0.74%
coreclr_tests.run.windows.x64.checked.mch 367,123,809,127 369,344,596,817 +0.60%
libraries.crossgen2.windows.x64.checked.mch 1,712,929 1,733,155 +1.18%
libraries.pmi.windows.x64.checked.mch 1,337,999,031 1,348,970,414 +0.82%
libraries_tests.pmi.windows.x64.checked.mch 5,373,609,215 5,402,007,210 +0.53%

FullOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 114,398,672,842 114,823,799,196 +0.37%
aspnet_block.run.windows.x64.checked.mch 21,548,132,116 21,634,356,327 +0.40%
benchmarks.run.windows.x64.checked.mch 55,288,071,079 55,476,452,139 +0.34%
coreclr_tests.run.windows.x64.checked.mch 453,584,025,958 454,619,878,654 +0.23%
libraries.crossgen2.windows.x64.checked.mch 123,752,537,362 124,286,810,914 +0.43%
libraries.pmi.windows.x64.checked.mch 232,829,938,437 233,751,867,779 +0.40%
libraries_tests.pmi.windows.x64.checked.mch 500,728,154,636 502,303,692,777 +0.31%

tpdiff with most of IsKInstruction check commented out...

Overall (+0.17% to +0.35%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.32%
aspnet_block.run.windows.x64.checked.mch +0.33%
benchmarks.run.windows.x64.checked.mch +0.26%
coreclr_tests.run.windows.x64.checked.mch +0.17%
libraries.crossgen2.windows.x64.checked.mch +0.35%
libraries.pmi.windows.x64.checked.mch +0.30%
libraries_tests.pmi.windows.x64.checked.mch +0.22%
MinOpts (+0.20% to +0.86%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.49%
aspnet_block.run.windows.x64.checked.mch +0.42%
benchmarks.run.windows.x64.checked.mch +0.32%
coreclr_tests.run.windows.x64.checked.mch +0.22%
libraries.crossgen2.windows.x64.checked.mch +0.86%
libraries.pmi.windows.x64.checked.mch +0.39%
libraries_tests.pmi.windows.x64.checked.mch +0.20%
FullOpts (+0.14% to +0.35%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.28%
aspnet_block.run.windows.x64.checked.mch +0.30%
benchmarks.run.windows.x64.checked.mch +0.26%
coreclr_tests.run.windows.x64.checked.mch +0.14%
libraries.crossgen2.windows.x64.checked.mch +0.35%
libraries.pmi.windows.x64.checked.mch +0.30%
libraries_tests.pmi.windows.x64.checked.mch +0.22%
Details

All contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 137,122,327,070 137,555,517,439 +0.32%
aspnet_block.run.windows.x64.checked.mch 28,196,936,039 28,290,027,003 +0.33%
benchmarks.run.windows.x64.checked.mch 55,716,216,346 55,859,859,330 +0.26%
coreclr_tests.run.windows.x64.checked.mch 820,707,397,894 822,143,099,586 +0.17%
libraries.crossgen2.windows.x64.checked.mch 123,754,441,178 124,186,483,562 +0.35%
libraries.pmi.windows.x64.checked.mch 234,169,440,084 234,865,933,264 +0.30%
libraries_tests.pmi.windows.x64.checked.mch 506,106,782,715 507,212,774,247 +0.22%

MinOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 22,723,762,125 22,835,457,571 +0.49%
aspnet_block.run.windows.x64.checked.mch 6,648,733,217 6,676,854,359 +0.42%
benchmarks.run.windows.x64.checked.mch 427,800,890 429,168,439 +0.32%
coreclr_tests.run.windows.x64.checked.mch 367,123,738,248 367,915,727,518 +0.22%
libraries.crossgen2.windows.x64.checked.mch 1,712,930 1,727,636 +0.86%
libraries.pmi.windows.x64.checked.mch 1,337,998,813 1,343,217,279 +0.39%
libraries_tests.pmi.windows.x64.checked.mch 5,373,613,051 5,384,456,794 +0.20%

FullOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 114,398,564,945 114,720,059,868 +0.28%
aspnet_block.run.windows.x64.checked.mch 21,548,202,822 21,613,172,644 +0.30%
benchmarks.run.windows.x64.checked.mch 55,288,415,456 55,430,690,891 +0.26%
coreclr_tests.run.windows.x64.checked.mch 453,583,659,646 454,227,372,068 +0.14%
libraries.crossgen2.windows.x64.checked.mch 123,752,728,248 124,184,755,926 +0.35%
libraries.pmi.windows.x64.checked.mch 232,831,441,271 233,522,715,985 +0.30%
libraries_tests.pmi.windows.x64.checked.mch 500,733,169,664 501,828,317,453 +0.22%

Seems like it was having an impact.

Let's see what the tpdiff looks like now --- most everything was behind proper ifdefs before, but availableMaskRegs instance variable was not, as well as TYP_MASK which would affect arm slightly. I've since moved.

@anthonycanino
Copy link
Contributor Author

image

@kunalspathak Looks better.

@BruceForstall I mostly addressed your comments, with one caveat --- right now, TYP_MASK is defined if FEATURE_SIMD is present and for XARCH only, consistent with SIMD32 and SIMD64. This creates a bit of ifdef dependency.

@BruceForstall
Copy link
Member

@anthonycanino Where is the TP regression coming from?

@anthonycanino
Copy link
Contributor Author

@anthonycanino Where is the TP regression coming from?

Investigating with pin now.

@BruceForstall
Copy link
Member

Diffs

Throughput impact on Windows x64

The following shows the impact on throughput in terms of number of instructions executed inside the JIT. Negative percentages/lower numbers are better.

linux arm64

Overall (-0.02% to -0.01%)
Collection PDIFF
benchmarks.run.linux.arm64.checked.mch -0.01%
coreclr_tests.run.linux.arm64.checked.mch -0.01%
libraries.crossgen2.linux.arm64.checked.mch -0.02%
libraries.pmi.linux.arm64.checked.mch -0.01%
libraries_tests.pmi.linux.arm64.checked.mch -0.01%
MinOpts (-0.08% to -0.01%)
Collection PDIFF
benchmarks.run.linux.arm64.checked.mch -0.06%
coreclr_tests.run.linux.arm64.checked.mch -0.01%
libraries.crossgen2.linux.arm64.checked.mch -0.08%
libraries.pmi.linux.arm64.checked.mch -0.03%
libraries_tests.pmi.linux.arm64.checked.mch -0.01%
FullOpts (-0.02% to -0.00%)
Collection PDIFF
benchmarks.run.linux.arm64.checked.mch -0.01%
libraries.crossgen2.linux.arm64.checked.mch -0.02%
libraries.pmi.linux.arm64.checked.mch -0.01%
libraries_tests.pmi.linux.arm64.checked.mch -0.01%
Details

All contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.linux.arm64.checked.mch 71,829,708,758 71,825,058,259 -0.01%
coreclr_tests.run.linux.arm64.checked.mch 1,120,118,011,248 1,120,045,044,886 -0.01%
libraries.crossgen2.linux.arm64.checked.mch 126,407,564,245 126,385,333,940 -0.02%
libraries.pmi.linux.arm64.checked.mch 239,547,008,704 239,517,121,722 -0.01%
libraries_tests.pmi.linux.arm64.checked.mch 573,413,222,714 573,370,564,634 -0.01%

MinOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.linux.arm64.checked.mch 1,185,157,334 1,184,421,151 -0.06%
coreclr_tests.run.linux.arm64.checked.mch 464,245,669,694 464,200,751,165 -0.01%
libraries.crossgen2.linux.arm64.checked.mch 1,959,736 1,958,112 -0.08%
libraries.pmi.linux.arm64.checked.mch 1,644,582,602 1,644,028,948 -0.03%
libraries_tests.pmi.linux.arm64.checked.mch 7,106,434,127 7,105,485,508 -0.01%

FullOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.linux.arm64.checked.mch 70,644,551,424 70,640,637,108 -0.01%
coreclr_tests.run.linux.arm64.checked.mch 655,872,341,554 655,844,293,721 -0.00%
libraries.crossgen2.linux.arm64.checked.mch 126,405,604,509 126,383,375,828 -0.02%
libraries.pmi.linux.arm64.checked.mch 237,902,426,102 237,873,092,774 -0.01%
libraries_tests.pmi.linux.arm64.checked.mch 566,306,788,587 566,265,079,126 -0.01%

osx arm64

Overall (-0.02% to -0.01%)
Collection PDIFF
benchmarks.run.osx.arm64.checked.mch -0.01%
coreclr_tests.run.osx.arm64.checked.mch -0.01%
libraries.crossgen2.osx.arm64.checked.mch -0.02%
libraries.pmi.osx.arm64.checked.mch -0.01%
libraries_tests.pmi.osx.arm64.checked.mch -0.01%
MinOpts (-0.08% to -0.01%)
Collection PDIFF
benchmarks.run.osx.arm64.checked.mch -0.03%
coreclr_tests.run.osx.arm64.checked.mch -0.01%
libraries.crossgen2.osx.arm64.checked.mch -0.08%
libraries.pmi.osx.arm64.checked.mch -0.03%
libraries_tests.pmi.osx.arm64.checked.mch -0.01%
FullOpts (-0.02% to -0.00%)
Collection PDIFF
benchmarks.run.osx.arm64.checked.mch -0.01%
libraries.crossgen2.osx.arm64.checked.mch -0.02%
libraries.pmi.osx.arm64.checked.mch -0.01%
libraries_tests.pmi.osx.arm64.checked.mch -0.01%
Details

All contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.osx.arm64.checked.mch 38,164,974,081 38,162,491,846 -0.01%
coreclr_tests.run.osx.arm64.checked.mch 859,418,196,995 859,356,913,472 -0.01%
libraries.crossgen2.osx.arm64.checked.mch 133,045,192,209 133,021,837,042 -0.02%
libraries.pmi.osx.arm64.checked.mch 248,363,814,282 248,331,667,102 -0.01%
libraries_tests.pmi.osx.arm64.checked.mch 546,442,041,072 546,399,704,570 -0.01%

MinOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.osx.arm64.checked.mch 574,094,530 573,941,898 -0.03%
coreclr_tests.run.osx.arm64.checked.mch 461,565,891,532 461,521,207,939 -0.01%
libraries.crossgen2.osx.arm64.checked.mch 2,111,547 2,109,807 -0.08%
libraries.pmi.osx.arm64.checked.mch 1,699,673,406 1,699,089,772 -0.03%
libraries_tests.pmi.osx.arm64.checked.mch 7,623,536,291 7,622,404,286 -0.01%

FullOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.osx.arm64.checked.mch 37,590,879,551 37,588,549,948 -0.01%
coreclr_tests.run.osx.arm64.checked.mch 397,852,305,463 397,835,705,533 -0.00%
libraries.crossgen2.osx.arm64.checked.mch 133,043,080,662 133,019,727,235 -0.02%
libraries.pmi.osx.arm64.checked.mch 246,664,140,876 246,632,577,330 -0.01%
libraries_tests.pmi.osx.arm64.checked.mch 538,818,504,781 538,777,300,284 -0.01%

linux x64

Overall (+0.13% to +0.26%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.13%
coreclr_tests.run.linux.x64.checked.mch +0.26%
libraries.crossgen2.linux.x64.checked.mch +0.16%
libraries.pmi.linux.x64.checked.mch +0.16%
libraries_tests.pmi.linux.x64.checked.mch +0.14%
MinOpts (+0.42% to +0.67%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.67%
coreclr_tests.run.linux.x64.checked.mch +0.45%
libraries.crossgen2.linux.x64.checked.mch +0.64%
libraries.pmi.linux.x64.checked.mch +0.58%
libraries_tests.pmi.linux.x64.checked.mch +0.42%
FullOpts (+0.12% to +0.16%)
Collection PDIFF
benchmarks.run.linux.x64.checked.mch +0.13%
coreclr_tests.run.linux.x64.checked.mch +0.12%
libraries.crossgen2.linux.x64.checked.mch +0.16%
libraries.pmi.linux.x64.checked.mch +0.16%
libraries_tests.pmi.linux.x64.checked.mch +0.14%
Details

All contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.linux.x64.checked.mch 69,973,416,200 70,067,485,678 +0.13%
coreclr_tests.run.linux.x64.checked.mch 870,208,033,751 872,470,252,016 +0.26%
libraries.crossgen2.linux.x64.checked.mch 92,833,460,634 92,985,284,214 +0.16%
libraries.pmi.linux.x64.checked.mch 223,281,109,753 223,646,234,128 +0.16%
libraries_tests.pmi.linux.x64.checked.mch 512,114,946,866 512,853,456,023 +0.14%

MinOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.linux.x64.checked.mch 931,862,432 938,085,561 +0.67%
coreclr_tests.run.linux.x64.checked.mch 372,511,404,092 374,175,448,655 +0.45%
libraries.crossgen2.linux.x64.checked.mch 1,455,829 1,465,121 +0.64%
libraries.pmi.linux.x64.checked.mch 1,323,192,173 1,330,840,704 +0.58%
libraries_tests.pmi.linux.x64.checked.mch 6,268,867,376 6,294,950,342 +0.42%

FullOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.linux.x64.checked.mch 69,041,553,768 69,129,400,117 +0.13%
coreclr_tests.run.linux.x64.checked.mch 497,696,629,659 498,294,803,361 +0.12%
libraries.crossgen2.linux.x64.checked.mch 92,832,004,805 92,983,819,093 +0.16%
libraries.pmi.linux.x64.checked.mch 221,957,917,580 222,315,393,424 +0.16%
libraries_tests.pmi.linux.x64.checked.mch 505,846,079,490 506,558,505,681 +0.14%

windows arm64

Overall (-0.02% to -0.01%)
Collection PDIFF
benchmarks.run.windows.arm64.checked.mch -0.01%
coreclr_tests.run.windows.arm64.checked.mch -0.01%
libraries.crossgen2.windows.arm64.checked.mch -0.02%
libraries.pmi.windows.arm64.checked.mch -0.01%
libraries_tests.pmi.windows.arm64.checked.mch -0.01%
MinOpts (-0.08% to -0.01%)
Collection PDIFF
benchmarks.run.windows.arm64.checked.mch -0.03%
coreclr_tests.run.windows.arm64.checked.mch -0.01%
libraries.crossgen2.windows.arm64.checked.mch -0.08%
libraries.pmi.windows.arm64.checked.mch -0.03%
libraries_tests.pmi.windows.arm64.checked.mch -0.02%
FullOpts (-0.02% to -0.00%)
Collection PDIFF
benchmarks.run.windows.arm64.checked.mch -0.01%
libraries.crossgen2.windows.arm64.checked.mch -0.02%
libraries.pmi.windows.arm64.checked.mch -0.01%
libraries_tests.pmi.windows.arm64.checked.mch -0.01%
Details

All contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.windows.arm64.checked.mch 54,201,662,021 54,198,199,687 -0.01%
coreclr_tests.run.windows.arm64.checked.mch 1,051,165,432,682 1,051,097,135,354 -0.01%
libraries.crossgen2.windows.arm64.checked.mch 143,600,959,958 143,576,062,944 -0.02%
libraries.pmi.windows.arm64.checked.mch 251,844,611,656 251,813,458,793 -0.01%
libraries_tests.pmi.windows.arm64.checked.mch 551,090,992,404 551,050,520,240 -0.01%

MinOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.windows.arm64.checked.mch 545,947,019 545,795,218 -0.03%
coreclr_tests.run.windows.arm64.checked.mch 465,525,194,605 465,479,805,506 -0.01%
libraries.crossgen2.windows.arm64.checked.mch 2,118,887 2,117,147 -0.08%
libraries.pmi.windows.arm64.checked.mch 1,733,086,668 1,732,517,304 -0.03%
libraries_tests.pmi.windows.arm64.checked.mch 5,936,698,580 5,935,782,255 -0.02%

FullOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.windows.arm64.checked.mch 53,655,715,002 53,652,404,469 -0.01%
coreclr_tests.run.windows.arm64.checked.mch 585,640,238,077 585,617,329,848 -0.00%
libraries.crossgen2.windows.arm64.checked.mch 143,598,841,071 143,573,945,797 -0.02%
libraries.pmi.windows.arm64.checked.mch 250,111,524,988 250,080,941,489 -0.01%
libraries_tests.pmi.windows.arm64.checked.mch 545,154,293,824 545,114,737,985 -0.01%

windows x64

Overall (+0.11% to +0.27%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.22%
aspnet_block.run.windows.x64.checked.mch +0.25%
benchmarks.run.windows.x64.checked.mch +0.11%
coreclr_tests.run.windows.x64.checked.mch +0.27%
libraries.crossgen2.windows.x64.checked.mch +0.16%
libraries.pmi.windows.x64.checked.mch +0.15%
libraries_tests.pmi.windows.x64.checked.mch +0.14%
MinOpts (+0.43% to +0.66%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.66%
aspnet_block.run.windows.x64.checked.mch +0.64%
benchmarks.run.windows.x64.checked.mch +0.58%
coreclr_tests.run.windows.x64.checked.mch +0.47%
libraries.crossgen2.windows.x64.checked.mch +0.66%
libraries.pmi.windows.x64.checked.mch +0.62%
libraries_tests.pmi.windows.x64.checked.mch +0.43%
FullOpts (+0.11% to +0.16%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +0.12%
aspnet_block.run.windows.x64.checked.mch +0.13%
benchmarks.run.windows.x64.checked.mch +0.11%
coreclr_tests.run.windows.x64.checked.mch +0.11%
libraries.crossgen2.windows.x64.checked.mch +0.16%
libraries.pmi.windows.x64.checked.mch +0.15%
libraries_tests.pmi.windows.x64.checked.mch +0.13%
Details

All contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 140,813,652,137 141,129,728,577 +0.22%
aspnet_block.run.windows.x64.checked.mch 28,217,820,407 28,288,450,930 +0.25%
benchmarks.run.windows.x64.checked.mch 55,793,824,417 55,856,927,078 +0.11%
coreclr_tests.run.windows.x64.checked.mch 821,229,287,058 823,456,845,113 +0.27%
libraries.crossgen2.windows.x64.checked.mch 123,855,428,806 124,049,862,818 +0.16%
libraries.pmi.windows.x64.checked.mch 234,485,615,873 234,837,654,741 +0.15%
libraries_tests.pmi.windows.x64.checked.mch 507,142,444,192 507,833,290,175 +0.14%

MinOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 27,529,857,925 27,712,507,266 +0.66%
aspnet_block.run.windows.x64.checked.mch 6,648,727,281 6,691,462,037 +0.64%
benchmarks.run.windows.x64.checked.mch 427,797,145 430,288,887 +0.58%
coreclr_tests.run.windows.x64.checked.mch 367,134,396,157 368,860,772,214 +0.47%
libraries.crossgen2.windows.x64.checked.mch 1,712,897 1,724,206 +0.66%
libraries.pmi.windows.x64.checked.mch 1,337,982,864 1,346,240,014 +0.62%
libraries_tests.pmi.windows.x64.checked.mch 5,373,442,496 5,396,620,061 +0.43%

FullOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 113,283,794,212 113,417,221,311 +0.12%
aspnet_block.run.windows.x64.checked.mch 21,569,093,126 21,596,988,893 +0.13%
benchmarks.run.windows.x64.checked.mch 55,366,027,272 55,426,638,191 +0.11%
coreclr_tests.run.windows.x64.checked.mch 454,094,890,901 454,596,072,899 +0.11%
libraries.crossgen2.windows.x64.checked.mch 123,853,715,909 124,048,138,612 +0.16%
libraries.pmi.windows.x64.checked.mch 233,147,633,009 233,491,414,727 +0.15%
libraries_tests.pmi.windows.x64.checked.mch 501,769,001,696 502,436,670,114 +0.13%

Throughput impact on Windows x86

The following shows the impact on throughput in terms of number of instructions executed inside the JIT. Negative percentages/lower numbers are better.

linux arm

Overall (-0.01% to -0.00%)
Collection PDIFF
libraries.crossgen2.linux.arm.checked.mch -0.01%
MinOpts (-0.03% to -0.00%)
Collection PDIFF
benchmarks.run.linux.arm.checked.mch -0.02%
libraries.crossgen2.linux.arm.checked.mch -0.03%
libraries.pmi.linux.arm.checked.mch -0.01%
FullOpts (-0.01% to -0.00%)
Collection PDIFF
libraries.crossgen2.linux.arm.checked.mch -0.01%
Details

All contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.linux.arm.checked.mch 88,925,509,972 88,923,283,376 -0.00%
coreclr_tests.run.linux.arm.checked.mch 1,077,827,852,065 1,077,805,328,402 -0.00%
libraries.crossgen2.linux.arm.checked.mch 88,933,986,476 88,928,563,121 -0.01%
libraries.pmi.linux.arm.checked.mch 243,437,368,111 243,426,162,137 -0.00%
libraries_tests.pmi.linux.arm.checked.mch 508,087,880,991 508,072,032,434 -0.00%

MinOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.linux.arm.checked.mch 1,661,375,568 1,660,964,746 -0.02%
coreclr_tests.run.linux.arm.checked.mch 443,043,133,437 443,030,285,490 -0.00%
libraries.crossgen2.linux.arm.checked.mch 2,144,707 2,144,083 -0.03%
libraries.pmi.linux.arm.checked.mch 1,656,826,419 1,656,621,067 -0.01%
libraries_tests.pmi.linux.arm.checked.mch 7,321,475,376 7,321,117,500 -0.00%

FullOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.linux.arm.checked.mch 87,264,134,404 87,262,318,630 -0.00%
coreclr_tests.run.linux.arm.checked.mch 634,784,718,628 634,775,042,912 -0.00%
libraries.crossgen2.linux.arm.checked.mch 88,931,841,769 88,926,419,038 -0.01%
libraries.pmi.linux.arm.checked.mch 241,780,541,692 241,769,541,070 -0.00%
libraries_tests.pmi.linux.arm.checked.mch 500,766,405,615 500,750,914,934 -0.00%

windows x86

Overall (+0.06% to +0.10%)
Collection PDIFF
benchmarks.run.windows.x86.checked.mch +0.07%
coreclr_tests.run.windows.x86.checked.mch +0.10%
libraries.crossgen2.windows.x86.checked.mch +0.08%
libraries.pmi.windows.x86.checked.mch +0.08%
libraries_tests.pmi.windows.x86.checked.mch +0.06%
MinOpts (+0.15% to +0.28%)
Collection PDIFF
benchmarks.run.windows.x86.checked.mch +0.23%
coreclr_tests.run.windows.x86.checked.mch +0.20%
libraries.crossgen2.windows.x86.checked.mch +0.28%
libraries.pmi.windows.x86.checked.mch +0.24%
libraries_tests.pmi.windows.x86.checked.mch +0.15%
FullOpts (+0.05% to +0.08%)
Collection PDIFF
benchmarks.run.windows.x86.checked.mch +0.06%
coreclr_tests.run.windows.x86.checked.mch +0.05%
libraries.crossgen2.windows.x86.checked.mch +0.08%
libraries.pmi.windows.x86.checked.mch +0.08%
libraries_tests.pmi.windows.x86.checked.mch +0.06%
Details

All contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.windows.x86.checked.mch 63,484,806,822 63,526,339,468 +0.07%
coreclr_tests.run.windows.x86.checked.mch 869,224,259,905 870,134,426,998 +0.10%
libraries.crossgen2.windows.x86.checked.mch 129,233,207,301 129,339,015,895 +0.08%
libraries.pmi.windows.x86.checked.mch 263,408,433,055 263,621,679,397 +0.08%
libraries_tests.pmi.windows.x86.checked.mch 547,530,257,762 547,879,527,062 +0.06%

MinOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.windows.x86.checked.mch 413,217,394 414,153,173 +0.23%
coreclr_tests.run.windows.x86.checked.mch 309,589,082,449 310,208,578,985 +0.20%
libraries.crossgen2.windows.x86.checked.mch 2,103,307 2,109,257 +0.28%
libraries.pmi.windows.x86.checked.mch 1,313,693,745 1,316,901,905 +0.24%
libraries_tests.pmi.windows.x86.checked.mch 5,375,510,184 5,383,557,156 +0.15%

FullOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
benchmarks.run.windows.x86.checked.mch 63,071,589,428 63,112,186,295 +0.06%
coreclr_tests.run.windows.x86.checked.mch 559,635,177,456 559,925,848,013 +0.05%
libraries.crossgen2.windows.x86.checked.mch 129,231,103,994 129,336,906,638 +0.08%
libraries.pmi.windows.x86.checked.mch 262,094,739,310 262,304,777,492 +0.08%
libraries_tests.pmi.windows.x86.checked.mch 542,154,747,578 542,495,969,906 +0.06%

Comment on lines +759 to +762
else if (thisType == TYP_MASK)
{
availableRegs[i] = &availableMaskRegs;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a general case like the other where integers are the most common, but they're at the bottom of the checked cases for the loop.

Presumably changing this to either a switch or to prioritize setting availableIntRegs as the first path would be beneficial so we don't have "one extra check" before hitting the path.


Actually, looking at this, this entire thing is just very unnecessarily expensive. We could track the classification in the typelist.h so we could do:

#define DEF_TP(tn, nm, jitType, verType, sz, sze, asze, st, al, tf, ar) availableRegs[static_cast<int>(TYP_##tn)] = &ar;
#include "typelist.h"
#undef DEF_TP
    TYP_COUNT

This would involve no loops, no complex checks as to type kind/etc.

@kunalspathak, thoughts? Probably not that impactful overall, since its just part of the LSRA constructor; but still seems "meaningful" given we have around 25 TYP_* defines and this is doing 4-5 checks to initialize 17 of them and 3-4 checks to initialize 6 others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- Not saying this needs to be handled in this PR, its something we can do as a follow up if its something we want to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated #83109 to take care of it.

Comment on lines +1609 to +1610
PhasedVar<regMaskTP> availableFloatRegs;
PhasedVar<regMaskTP> availableDoubleRegs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kunalspathak, it's not clear why we have availableFloatRegs and availableDoubleRegs. Could you elaborate a bit?

Given that availableDoubleRegs is used to account for both double and TYP_SIMD and that float/double/simd are all the "same registers", it's not immediately clear why we have the split.

Even on Arm32 where you have 1x double register taking up 2x float registers, it seems like it'd be simpler/easier to track this as a single "register file" and for double to simply take up 2 bits.

Copy link
Member

@tannergooding tannergooding Mar 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- Not something to handle in this PR, more making observations and asking questions as this code is being touched in general and looking for potential opportunities to reduce the impact of changes like this, particularly given how impactful LSRA changes are to overall TP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't know historical reason behind having 2 separate variables for floats and double although we have #define RBM_ALLDOUBLE RBM_ALLFLOAT in all targets except ARM. I will have to spend some time in understanding the implication if we unify them.

@anthonycanino
Copy link
Contributor Author

Does this look good now? The throughput numbers seem pretty good.

@BruceForstall
Copy link
Member

Diffs -- looks good.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Would like threadsuspend.cpp removed from the change.

Would like @tannergooding to also review.

src/coreclr/vm/threadsuspend.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/instrsxarch.h Outdated Show resolved Hide resolved
@anthonycanino
Copy link
Contributor Author

I think I've addressed all issues?

@tannergooding
Copy link
Member

Will get this reviewed a bit later tonight and should be able to get it merged at that time.

@BruceForstall BruceForstall added the avx512 Related to the AVX-512 architecture label Mar 10, 2023
Comment on lines +44 to +45
// Technically, K instructions would be considered under the VEX encoding umbrella, but due to
// the instruction table encoding had to be pulled out with the rest of the `INST5` definitions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs to be updated slightly since KInstructions are part of the INST3 group now.

@@ -46,7 +53,7 @@ bool emitter::IsSSEOrAVXInstruction(instruction ins)
bool emitter::IsAvx512OrPriorInstruction(instruction ins)
{
// TODO-XArch-AVX512: Fix check once AVX512 instructions are added.
return (ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION);
return ((ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
return ((ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION));
return (ins >= INS_FIRST_SSE_INSTRUCTION) && (ins <= INS_LAST_AVX512_INSTRUCTION);

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants