Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise how constant SIMD vectors are defined in BCL #44115

Closed
EgorBo opened this issue Nov 1, 2020 · 13 comments · Fixed by #54827
Closed

Revise how constant SIMD vectors are defined in BCL #44115

EgorBo opened this issue Nov 1, 2020 · 13 comments · Fixed by #54827
Labels
area-System.Runtime.Intrinsics good first issue Issue should be easy to implement, good for first-time contributors help wanted [up-for-grabs] Good issue for external contributors tenet-performance Performance related issue
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Nov 1, 2020

From #44111 (comment)

There are 3 patterns we currently use across the BCL for const vectors:

    //
    // Case 1: Plain Vector.Create
    public Vector128<byte> Case1(Vector128<byte> vec)
    {
        Vector128<byte> mask = Vector128.Create(
            0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
            0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF);

        return Ssse3.Shuffle(vec, mask);
    }


    //
    // Case 1.1: Plain Vector.Create as argument of some SIMD instruction directly
    // Should be the same codegen as for Case1 ^ (spoiler: it's not. Forward Substitution? see #4655)
    public Vector128<byte> Case1_1(Vector128<byte> vec)
    {
        return Ssse3.Shuffle(vec,
                    Vector128.Create( // used without "mask" local as in Case1 ^
                        0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
                        0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF));
    }


    //
    // Case 2: static readonly Vector
    private static readonly Vector128<byte> s_mask = Vector128.Create(
        0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
        0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF);
    public Vector128<byte> Case2(Vector128<byte> vec)
    {
        // we also often save it to a local first (e.g. before loops)
        return Ssse3.Shuffle(vec, s_mask);
    }


    //
    // Case 3: Roslyn's hack
    private static ReadOnlySpan<byte> Mask => new byte[] {
        0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 1, 0xFF,
        0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF, 3, 0xFF };
    public Vector128<byte> Case3(Vector128<byte> vec)
    {
        return Ssse3.Shuffle(vec, Unsafe.ReadUnaligned<Vector128<byte>>(
            ref MemoryMarshal.GetReference(Mask)));
    }

Here is the current codegen for these cases:

; Method Case1
G_M46269_IG01:
       vzeroupper 
G_M46269_IG02:
       vmovupd  xmm0, xmmword ptr [reloc @RWD00]        ; loaded from the data section, OK
       vmovupd  xmm1, xmmword ptr [r8]
       vpshufb  xmm0, xmm1, xmm0
       vmovupd  xmmword ptr [rdx], xmm0
       mov      rax, rdx
G_M46269_IG03:
       ret      
RWD00  	dq	FF01FFFFFF00FFFFh, FF03FFFFFF02FFFFh    ; <-- it's here
; Total bytes of code: 29



; Method Case1_1
G_M7091_IG01:
       vzeroupper 
G_M7091_IG02:
       vmovupd  xmm0, xmmword ptr [r8]
       vpshufb  xmm0, xmm0, xmmword ptr [reloc @RWD00]  ; loaded as part of vshufb without 
                                                        ; additional registers from the data secion - PERFECT!!
       vmovupd  xmmword ptr [rdx], xmm0
       mov      rax, rdx
G_M7091_IG03:
       ret      
RWD00  	dq	FF01FFFFFF00FFFFh, FF03FFFFFF02FFFFh    ; <-- it's here
; Total bytes of code: 25



; Method Case2
G_M31870_IG01:
       push     rsi
       sub      rsp, 48
       vzeroupper 
       mov      rsi, rdx
G_M31870_IG02:
       vmovupd  xmm0, xmmword ptr [r8]
       vmovupd  xmmword ptr [rsp+20H], xmm0
       mov      rcx, 0xD1FFAB1E
       mov      edx, 2
       call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE  ; static initialization (in some cases can be eliminated by jit but still...)
       mov      rax, 0xD1FFAB1E                          ; <- additional mov
       mov      rax, gword ptr [rax]
       vmovupd  xmm0, xmmword ptr [rsp+20H]
       vpshufb  xmm0, xmm0, xmmword ptr [rax+8]
       vmovupd  xmmword ptr [rsi], xmm0
       mov      rax, rsi
G_M31870_IG03:
       add      rsp, 48
       pop      rsi
       ret      
; Total bytes of code: 80



; Method Case3
G_M23615_IG01:
       vzeroupper 
G_M23615_IG02:
       mov      rax, 0xD1FFAB1E                          ; <- additional mov
                                                         ; kind of makes sense if the same mask is used from different methods
                                                         ; but the C# code looks a bit ugly
       vmovupd  xmm0, xmmword ptr [rax]
       vmovupd  xmm1, xmmword ptr [r8]
       vpshufb  xmm0, xmm1, xmm0
       vmovupd  xmmword ptr [rdx], xmm0
       mov      rax, rdx
G_M23615_IG03:
       ret      
; Total bytes of code: 35

The first case used to be avoided due to some codegen issues, but looks like those were resolved (e.g. JIT now saves such vectors into the data section, does Value Numbering for SIMDs including constant vectors, does CSE, etc - #31834?) so we now have a lot of static readonly fields and we can revise them and convert into Case1(1.1)-style where possible (maybe even if we need to duplicate them in different methods), e.g.:

Places to revise:

  1. Base64Encoder.cs
  2. Base64Decoder.cs
  3. Sse2Helper.cs
  4. Ssse3Helper.cs
  5. BitArray.cs
  6. Maybe there are more
  7. Scan other repos: aspnetcore, ML.NET, etc

Known limitations for Case1:

  • See Case1.1 comment
  • JIT doesn't hoist Vector.Create from loops' bodies yet (would be nice to have)

/cc @stephentoub @GrabYourPitchforks @benaadams @tannergooding

@EgorBo EgorBo added the tenet-performance Performance related issue label Nov 1, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Nov 1, 2020
@benaadams
Copy link
Member

RWD00  	dq	FF01FFFFFF00FFFFh, FF03FFFFFF02FFFFh    ; <-- it's here

Is this aligned?

@tannergooding
Copy link
Member

Both the 16 and 32-byte SIMD constants will be properly aligned if we aren't emitting "small" code:
https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/lowerxarch.cpp#L716-L717

@tannergooding
Copy link
Member

tannergooding commented Nov 1, 2020

I don't believe we currently have the logic to allow those being folded into the corresponding operand (which is always safe for AVX+ and safe for non-small code for SSE+).

Nevermind, the example above is already folding 👍

@EgorBo
Copy link
Member Author

EgorBo commented Nov 1, 2020

@tannergooding there is a minor issue with alignment:

    public static void Test(ref Vector128<double> a, ref Vector256<double> b)
    {
        a = Vector128.Create(1.0, 2.0);
        b = Vector256.Create(11.0, 12.0, 13.0, 14.0);
    }
; Method Egor:Test(byref,byref)
G_M24708_IG01:
       vzeroupper 
G_M24708_IG02:
       vmovupd  xmm0, xmmword ptr [reloc @RWD00]
       vmovupd  xmmword ptr [rcx], xmm0
       vmovupd  ymm0, ymmword ptr[reloc @RWD32]
       vmovupd  ymmword ptr[rdx], ymm0
G_M24708_IG03:
       vzeroupper 
       ret      
RWD00  	dq	3FF0000000000000h, 4000000000000000h
RWD16  	dd	00000000h, 00000000h, 00000000h, 00000000h ;; <--- padding
RWD32  	dq	4026000000000000h, 4028000000000000h, 402A000000000000h, 402C000000000000h
; Total bytes of code: 31

It could be:

RWD00  	dq	4026000000000000h, 4028000000000000h, 402A000000000000h, 402C000000000000h
RWD32  	dq	3FF0000000000000h, 4000000000000000h

slightly more compact

@tannergooding
Copy link
Member

Right, we don't currently do any sorting of values and I don't know how amiable the existing data structures are to having that happen.

@ghost
Copy link

ghost commented Nov 2, 2020

Tagging subscribers to this area: @tannergooding, @jeffhandley
See info in area-owners.md if you want to be subscribed.

@gfoidl
Copy link
Member

gfoidl commented Nov 2, 2020

We need to take into account that on older runtimes there may be a regression by using Vector128.Create (JIT got improved for this recently). E.g. System.Memory (for Base64) could regress, when the package gets uupdated for an .NET Core 3.1 target.

@tannergooding tannergooding added good first issue Issue should be easy to implement, good for first-time contributors help wanted [up-for-grabs] Good issue for external contributors and removed untriaged New issue has not been triaged by the area owner labels Jun 17, 2021
@tannergooding tannergooding added this to the Future milestone Jun 17, 2021
@tannergooding
Copy link
Member

Marked as easy and up-for-grabs.

@gfoidl
Copy link
Member

gfoidl commented Jun 18, 2021

Do we need to care about the older runtimes or just use the current best pattern?

@tannergooding
Copy link
Member

This mostly applies to hardware intrinsics which are .NET Core 3.1+ only, so I don't believe we have much to be concerned about here (I don't believe we are compiling for netcoreapp3.1 and net6.0 simultaneously).

If we do target downlevel TFMs, then we should consider the perf implications as .NET 3.1 will have continued support through Dec 2022: https://dotnet.microsoft.com/platform/support/policy/dotnet-core

@gfoidl
Copy link
Member

gfoidl commented Jun 18, 2021

I can create a PR (next week or so), then we can discuss more over there?!

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 28, 2021
@gfoidl
Copy link
Member

gfoidl commented Jun 28, 2021

Scan other repos: ... ML.NET, etc

ML.NET's highest .NET version (for the relevant project) is .NET Core 3.1, so current code over there seems to be the best option. Using Vector{128|256}.Create would be a de-optimization.

Places:
https://github.com/dotnet/machinelearning/blob/1b3cb77b9752fe4279376039ee20fc42822e4845/src/Microsoft.ML.CpuMath/AvxIntrinsics.cs#L48
https://github.com/dotnet/machinelearning/blob/1b3cb77b9752fe4279376039ee20fc42822e4845/src/Microsoft.ML.CpuMath/FactorizationMachine/AvxIntrinsics.cs#L15

ASP.NET Core uses Vector{128|256}.Create already, no need to change something over there.

@ghost ghost closed this as completed in #54827 Jul 12, 2021
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 12, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 11, 2021
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime.Intrinsics good first issue Issue should be easy to implement, good for first-time contributors help wanted [up-for-grabs] Good issue for external contributors tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants