Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize MemoryMarshal.Cast for same type #100750

Closed
wants to merge 1 commit into from

Conversation

xtqqczze
Copy link
Contributor

@xtqqczze xtqqczze commented Apr 7, 2024

Follow-up to #99835.

Diffs

Diffs looks mostly positive although there are some large code size regressions due to enablement of more inlining.

cc: @stephentoub

Copy link
Contributor

Tagging subscribers to this area: @dotnet/interop-contrib
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member

EgorBo commented Apr 8, 2024

Didn't we discuss that this fast path makes no sense? in #99835 (comment)

@jakobbotsch
Copy link
Member

jakobbotsch commented Apr 8, 2024

Just echoing what @EgorBo said, due to that stall mentioned there, it's not clearly a win even if the codegen "looks" better. For example:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

BenchmarkSwitcher.FromAssembly(typeof(Benchmark).Assembly).Run(args);

public class Benchmark
{
    private int[] _values = Enumerable.Range(0, 10000).Select(_ => Random.Shared.Next(2)).ToArray();

    [Benchmark]
    public void BenchBase()
    {
        Span<int> s;
        for (int i = 0; i < 10000; i++)
        {
            TestBase(_values.AsSpan(0, i), out s);
        }
    }

    [Benchmark]
    public void BenchDiff()
    {
        Span<int> s;
        for (int i = 0; i < 10000; i++)
        {
            TestDiff(_values.AsSpan(0, i), out s);
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void TestBase<T>(Span<T> s, out Span<int> s2) where T : struct
    {
        s2 = CastBase<T, int>(s);
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private static void TestDiff<T>(Span<T> s, out Span<int> s2) where T : struct
    {
        s2 = CastDiff<T, int>(s);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static unsafe Span<TTo> CastDiff<TFrom, TTo>(Span<TFrom> span)
        where TFrom : struct
        where TTo : struct
    {
        if (typeof(TFrom) == typeof(TTo))
            return *(Span<TTo>*)&span;

        // Use unsigned integers - unsigned division by constant (especially by power of 2)
        // and checked casts are faster and smaller.
        uint fromSize = (uint)Unsafe.SizeOf<TFrom>();
        uint toSize = (uint)Unsafe.SizeOf<TTo>();
        uint fromLength = (uint)span.Length;
        int toLength;
        if (fromSize == toSize)
        {
            // Special case for same size types - `(ulong)fromLength * (ulong)fromSize / (ulong)toSize`
            // should be optimized to just `length` but the JIT doesn't do that today.
            toLength = (int)fromLength;
        }
        else if (fromSize == 1)
        {
            // Special case for byte sized TFrom - `(ulong)fromLength * (ulong)fromSize / (ulong)toSize`
            // becomes `(ulong)fromLength / (ulong)toSize` but the JIT can't narrow it down to `int`
            // and can't eliminate the checked cast. This also avoids a 32 bit specific issue,
            // the JIT can't eliminate long multiply by 1.
            toLength = (int)(fromLength / toSize);
        }
        else
        {
            // Ensure that casts are done in such a way that the JIT is able to "see"
            // the uint->ulong casts and the multiply together so that on 32 bit targets
            // 32x32to64 multiplication is used.
            ulong toLengthUInt64 = (ulong)fromLength * (ulong)fromSize / (ulong)toSize;
            toLength = checked((int)toLengthUInt64);
        }

        return MemoryMarshal.CreateSpan(ref Unsafe.As<TFrom, TTo>(ref MemoryMarshal.GetReference(span)), toLength);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static unsafe Span<TTo> CastBase<TFrom, TTo>(Span<TFrom> span)
        where TFrom : struct
        where TTo : struct
    {
        // Use unsigned integers - unsigned division by constant (especially by power of 2)
        // and checked casts are faster and smaller.
        uint fromSize = (uint)Unsafe.SizeOf<TFrom>();
        uint toSize = (uint)Unsafe.SizeOf<TTo>();
        uint fromLength = (uint)span.Length;
        int toLength;
        if (fromSize == toSize)
        {
            // Special case for same size types - `(ulong)fromLength * (ulong)fromSize / (ulong)toSize`
            // should be optimized to just `length` but the JIT doesn't do that today.
            toLength = (int)fromLength;
        }
        else if (fromSize == 1)
        {
            // Special case for byte sized TFrom - `(ulong)fromLength * (ulong)fromSize / (ulong)toSize`
            // becomes `(ulong)fromLength / (ulong)toSize` but the JIT can't narrow it down to `int`
            // and can't eliminate the checked cast. This also avoids a 32 bit specific issue,
            // the JIT can't eliminate long multiply by 1.
            toLength = (int)(fromLength / toSize);
        }
        else
        {
            // Ensure that casts are done in such a way that the JIT is able to "see"
            // the uint->ulong casts and the multiply together so that on 32 bit targets
            // 32x32to64 multiplication is used.
            ulong toLengthUInt64 = (ulong)fromLength * (ulong)fromSize / (ulong)toSize;
            toLength = checked((int)toLengthUInt64);
        }

        return MemoryMarshal.CreateSpan(ref Unsafe.As<TFrom, TTo>(ref MemoryMarshal.GetReference(span)), toLength);
    }
}
BenchmarkDotNet=v0.13.5, OS=Windows 10 (10.0.19045.4170/22H2/2022Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.203
  [Host]     : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.3 (8.0.324.11423), X64 RyuJIT AVX2
Method Mean Error StdDev
BenchBase 12.96 us 0.109 us 0.102 us
BenchDiff 45.36 us 0.146 us 0.122 us

@xtqqczze
Copy link
Contributor Author

xtqqczze commented Apr 8, 2024

@jakobbotsch Thanks for the BenchmarkDotNet example, is this issue specific to XArch?

BenchmarkDotNet v0.13.12, macOS Sonoma 14.4 (23E214) [Darwin 23.4.0]
Apple M2, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.203
  [Host]     : .NET 8.0.3 (8.0.324.11423), Arm64 RyuJIT AdvSIMD
  Job-UYGVGG : .NET 8.0.3 (8.0.324.11423), Arm64 RyuJIT AdvSIMD

Runtime=.NET 8.0  Toolchain=net80  
Method Mean Error StdDev
BenchBase 9.612 us 0.0354 us 0.0314 us
BenchDiff 9.627 us 0.0495 us 0.0463 us

@@ -122,6 +122,9 @@ public static unsafe ReadOnlySpan<byte> AsBytes<T>(ReadOnlySpan<T> span)
if (RuntimeHelpers.IsReferenceOrContainsReferences<TTo>())
ThrowHelper.ThrowInvalidTypeWithPointersNotSupported(typeof(TTo));

if (typeof(TFrom) == typeof(TTo))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change would be a regression for shared generic code. TFrom and TTo may require generic dictionary lookups.

The identical types should be covered by if (fromSize == toSize) path below. Is there anything preventing the JIT from generating the optimal code for that path?

@github-actions github-actions bot locked and limited conversation to collaborators May 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime.InteropServices community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants