JIT: Compute BB weights with higher precision, use tolerance in if-conversion #88385

jakobbotsch · 2023-07-04T15:25:04Z

In #88376 (comment) I noticed a spurious diff in the System.Tests.Perf_Int32.ToStringHex benchmark that looked like this:

 G_M62987_IG13:
        and      edi, esi
        mov      ebx, dword ptr [rbp+88H]
-       mov      ecx, 1
        test     ebx, ebx
-       cmovle   ebx, ecx
+       jg       SHORT G_M62987_IG15
+						;; size=12 bbWeight=1.00 PerfScore 2.50
+G_M62987_IG14:
+       mov      ebx, 1
+						;; size=5 bbWeight=0.98 PerfScore 0.24
+G_M62987_IG15:

Investigating, it turns out to be caused by a check in if-conversion for detecting loops, that is using a very sharp boundary. It turns out that the integral block weights are exactly the same (in both the base and diff), but the floating point calculation ended up with exactly 1 and very close to 1 in the base/diff cases respectively.

So two changes to address it:

Switch getBBWeight to multiply by BB_UNITY_WEIGHT after dividing
Switch if-conversion to have a 5% tolerance so that losing just a single count in the wrong place will not cause a decision change. Note that the check here is used as a cheap "is inside loop" check.

…nversion In dotnet#88376 (comment) I noticed a spurious diff in the System.Tests.Perf_Int32.ToStringHex benchmark that looked like this: ```diff G_M62987_IG13: and edi, esi mov ebx, dword ptr [rbp+88H] - mov ecx, 1 test ebx, ebx - cmovle ebx, ecx + jg SHORT G_M62987_IG15 + ;; size=12 bbWeight=1.00 PerfScore 2.50 +G_M62987_IG14: + mov ebx, 1 + ;; size=5 bbWeight=0.98 PerfScore 0.24 +G_M62987_IG15: ``` Investigating, it turns out to be caused by a check in if-conversion for detecting loops, that is using a very sharp boundary. It turns out that the integral block weights are exactly the same (in both the base and diff), but the floating point calculation ended up with exactly 1 and very close to 1 in the base/diff cases respectively. So two changes to address it: * Switch getBBWeight to multiply by BB_UNITY_WEIGHT after dividing * Switch if-conversion to have a 5% tolerance so that losing just a single count in the wrong place will not cause a decision change. Note that the check here is used as a cheap "is inside loop" check.

ghost · 2023-07-04T15:25:19Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

In #88376 (comment) I noticed a spurious diff in the System.Tests.Perf_Int32.ToStringHex benchmark that looked like this:

 G_M62987_IG13:
        and      edi, esi
        mov      ebx, dword ptr [rbp+88H]
-       mov      ecx, 1
        test     ebx, ebx
-       cmovle   ebx, ecx
+       jg       SHORT G_M62987_IG15
+						;; size=12 bbWeight=1.00 PerfScore 2.50
+G_M62987_IG14:
+       mov      ebx, 1
+						;; size=5 bbWeight=0.98 PerfScore 0.24
+G_M62987_IG15:

Investigating, it turns out to be caused by a check in if-conversion for detecting loops, that is using a very sharp boundary. It turns out that the integral block weights are exactly the same (in both the base and diff), but the floating point calculation ended up with exactly 1 and very close to 1 in the base/diff cases respectively.

So two changes to address it:

Switch getBBWeight to multiply by BB_UNITY_WEIGHT after dividing
Switch if-conversion to have a 5% tolerance so that losing just a single count in the wrong place will not cause a decision change. Note that the check here is used as a cheap "is inside loop" check.

Author:	jakobbotsch
Assignees:	jakobbotsch
Labels:	`area-CodeGen-coreclr`
Milestone:	-

src/coreclr/jit/block.cpp

AndyAyersMS · 2023-07-05T15:37:32Z

src/coreclr/jit/ifconversion.cpp

        // Detect via the block weight as that will be high when inside a loop.
-        if (m_startBlock->getBBWeight(m_comp) > BB_UNITY_WEIGHT)
+
+        if (m_startBlock->getBBWeight(m_comp) > BB_UNITY_WEIGHT * 1.05)


Would be nice to stylize these comparisons, so maybe add a new fgProfileWeight... helper here?

In particular these sorts of checks are often parts of profitability heuristics which we might want to vary under stress and/or expose to some machine learning mechanism.

I don't think playing around with this threshold makes much sense given its use as an "is inside loop" check. The precise check happens below when this check is false.

This is not really checking if the block is likely to be inside a loop, it is checking if the block is likely to be inside a frequently executed loop.

If we assume profile data is representative, then perhaps this is fine; if we think it might not be representative, then perhaps we should be checking bbNatLoopNum and if set, either bypass the conversion, or go on to compare the weight of this block to the weight of the loop entry (so either "in a loop" or "frequently executed wrt the loop"). I wonder if this would subsume the need for an optReachable check.

Do we know if the cases in #82106 did not get handled by our loop recognition? I don't recall why we didn't check this initially -- do we no longer trust the loop table? Seems like it might still be good enough.

The regression in #82106 was because we if-converted inside an unnatural loop that was not recognized by our loop recognition. The loop is this code:

runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs

Lines 87 to 137 in 0202b24

do

{

// Make sure we don't go out of bounds

Debug.Assert(offset + ch1ch2Distance + Vector256<ushort>.Count <= searchSpaceLength);

Vector256<ushort> cmpCh2 = Vector256.Equals(ch2, Vector256.LoadUnsafe(ref ushortSearchSpace, (nuint)(offset + ch1ch2Distance)));

Vector256<ushort> cmpCh1 = Vector256.Equals(ch1, Vector256.LoadUnsafe(ref ushortSearchSpace, (nuint)offset));

Vector256<byte> cmpAnd = (cmpCh1 & cmpCh2).AsByte();

// Early out: cmpAnd is all zeros

if (cmpAnd != Vector256<byte>.Zero)

{

goto CANDIDATE_FOUND;

}

LOOP_FOOTER:

offset += Vector256<ushort>.Count;

if (offset == searchSpaceMinusValueTailLength)

return -1;

// Overlap with the current chunk for trailing elements

if (offset > searchSpaceMinusValueTailLengthAndVector)

offset = searchSpaceMinusValueTailLengthAndVector;

continue;

CANDIDATE_FOUND:

uint mask = cmpAnd.ExtractMostSignificantBits();

do

{

int bitPos = BitOperations.TrailingZeroCount(mask);

// div by 2 (shr) because we work with 2-byte chars

nint charPos = (nint)((uint)bitPos / 2);

if (valueLength == 2 || // we already matched two chars

SequenceEqual(

ref Unsafe.As<char, byte>(ref Unsafe.Add(ref searchSpace, offset + charPos)),

ref Unsafe.As<char, byte>(ref value), (nuint)(uint)valueLength * 2))

{

return (int)(offset + charPos);

}

// Clear two the lowest set bits

if (Bmi1.IsSupported)

mask = Bmi1.ResetLowestSetBit(Bmi1.ResetLowestSetBit(mask));

else

mask &= ~(uint)(0b11 << bitPos);

} while (mask != 0);

goto LOOP_FOOTER;

} while (true);

That's why we ended up with optReachable here.

There is an alternative to simply get rid of this block weight check.

Ok... let's leave this as is for now.

@BruceForstall do we have a work item for loops we ought to recognize but don't? I see #43713 so perhaps that's good enough for now.

Co-authored-by: Andy Ayers <andya@microsoft.com>

AndyAyersMS · 2023-07-05T17:25:02Z

src/coreclr/jit/ifconversion.cpp

        // Detect via the block weight as that will be high when inside a loop.
-        if (m_startBlock->getBBWeight(m_comp) > BB_UNITY_WEIGHT)
+
+        if (m_startBlock->getBBWeight(m_comp) > BB_UNITY_WEIGHT * 1.05)


Ok... let's leave this as is for now.

@BruceForstall do we have a work item for loops we ought to recognize but don't? I see #43713 so perhaps that's good enough for now.

BruceForstall · 2023-07-05T18:36:07Z

@BruceForstall do we have a work item for loops we ought to recognize but don't? I see #43713 so perhaps that's good enough for now.

That's the issue I'd reference

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 4, 2023

ghost assigned jakobbotsch Jul 4, 2023

Run jit-format

7fa5c32

EgorBo mentioned this pull request Jul 5, 2023

Intrinsify Array GetArrayDataReference for SZ arrays #87374

Merged

AndyAyersMS reviewed Jul 5, 2023

View reviewed changes

Update src/coreclr/jit/block.cpp

ead4313

Co-authored-by: Andy Ayers <andya@microsoft.com>

AndyAyersMS approved these changes Jul 5, 2023

View reviewed changes

jakobbotsch merged commit 3f19af8 into dotnet:main Jul 5, 2023
125 checks passed

jakobbotsch deleted the if-conversion-weight-tolerance branch July 5, 2023 18:37

EgorBo mentioned this pull request Jul 13, 2023

[Perf] Linux/arm64: 2 Regressions on 7/6/2023 2:30:01 AM dotnet/perf-autofiling-issues#19786

Closed

jakobbotsch mentioned this pull request Jul 19, 2023

[Perf] Windows/x64: 13 Regressions on 7/1/2023 10:14:52 PM #88376

Closed

ghost locked as resolved and limited conversation to collaborators Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Compute BB weights with higher precision, use tolerance in if-conversion #88385

JIT: Compute BB weights with higher precision, use tolerance in if-conversion #88385

jakobbotsch commented Jul 4, 2023

ghost commented Jul 4, 2023

AndyAyersMS Jul 5, 2023

jakobbotsch Jul 5, 2023

AndyAyersMS Jul 5, 2023

jakobbotsch Jul 5, 2023

AndyAyersMS Jul 5, 2023

AndyAyersMS Jul 5, 2023

BruceForstall commented Jul 5, 2023

	do
	{
	// Make sure we don't go out of bounds
	Debug.Assert(offset + ch1ch2Distance + Vector256<ushort>.Count <= searchSpaceLength);

	Vector256<ushort> cmpCh2 = Vector256.Equals(ch2, Vector256.LoadUnsafe(ref ushortSearchSpace, (nuint)(offset + ch1ch2Distance)));
	Vector256<ushort> cmpCh1 = Vector256.Equals(ch1, Vector256.LoadUnsafe(ref ushortSearchSpace, (nuint)offset));
	Vector256<byte> cmpAnd = (cmpCh1 & cmpCh2).AsByte();

	// Early out: cmpAnd is all zeros
	if (cmpAnd != Vector256<byte>.Zero)
	{
	goto CANDIDATE_FOUND;
	}

	LOOP_FOOTER:
	offset += Vector256<ushort>.Count;

	if (offset == searchSpaceMinusValueTailLength)
	return -1;

	// Overlap with the current chunk for trailing elements
	if (offset > searchSpaceMinusValueTailLengthAndVector)
	offset = searchSpaceMinusValueTailLengthAndVector;

	continue;

	CANDIDATE_FOUND:
	uint mask = cmpAnd.ExtractMostSignificantBits();
	do
	{
	int bitPos = BitOperations.TrailingZeroCount(mask);
	// div by 2 (shr) because we work with 2-byte chars
	nint charPos = (nint)((uint)bitPos / 2);
	if (valueLength == 2 \|\| // we already matched two chars
	SequenceEqual(
	ref Unsafe.As<char, byte>(ref Unsafe.Add(ref searchSpace, offset + charPos)),
	ref Unsafe.As<char, byte>(ref value), (nuint)(uint)valueLength * 2))
	{
	return (int)(offset + charPos);
	}

	// Clear two the lowest set bits
	if (Bmi1.IsSupported)
	mask = Bmi1.ResetLowestSetBit(Bmi1.ResetLowestSetBit(mask));
	else
	mask &= ~(uint)(0b11 << bitPos);
	} while (mask != 0);
	goto LOOP_FOOTER;

	} while (true);

JIT: Compute BB weights with higher precision, use tolerance in if-conversion #88385

JIT: Compute BB weights with higher precision, use tolerance in if-conversion #88385

Conversation

jakobbotsch commented Jul 4, 2023

ghost commented Jul 4, 2023

AndyAyersMS Jul 5, 2023

Choose a reason for hiding this comment

jakobbotsch Jul 5, 2023

Choose a reason for hiding this comment

AndyAyersMS Jul 5, 2023

Choose a reason for hiding this comment

jakobbotsch Jul 5, 2023

Choose a reason for hiding this comment

AndyAyersMS Jul 5, 2023

Choose a reason for hiding this comment

AndyAyersMS Jul 5, 2023

Choose a reason for hiding this comment

BruceForstall commented Jul 5, 2023