Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64: Implement VectorTableLookup/VectorTableLookupExtension intrinsinsic + Consecutive registers support #80297

Merged
merged 136 commits into from
Apr 4, 2023

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented Jan 6, 2023

This adds support for remaining variant of VectorTableLookup() intrinsics that takes tuples for table parameter. The tuple can have 2, 3 or 4 Vector128<byte> values. These APIs has a requirement to allocate consecutive registers for the 1st operand value. This PR adds support for that as well.

Details

During importer, depending on the form of 1st argument to the VectorTableLookup(), create a FIELD_LIST node with as many fields as number of entries in the ValueTuple of that argument. This node is further decomposed into local var.

In register allocation, during buildInterval(), we see this intrinsic and build as many refPositions as number of fields in the ValueTuple. We mark each of them with a flag needsConsecutive and for the first refPosition, we also set number of consecutive registers the entire series need (again, this is same as number of fields of ValueTuple in 1st argument). We also save the series in a newly added nextConsecutiveRefPositionMap map such that we could go to the next refposition in the series easily. (Note that RefPosition doesn't have a next pointer and adding that would have memory cost, hence I chose to add entries in a map).

During register assignment, once the "first" refPosition gets a register, it sets the registerAssignment (Prior to the allocation pass, registerAssignment captures the valid registers for a RefPosition) of each subsequent RefPositions of the series to the corresponding consecutive register(s) that comes after. Register assignment honors that decision (like it does today) and assigns those registers to the subsequent refpositions of that series.

Example:

public static Vector128<byte> Test(float f)
    {
        var a = Produce1();
        var b = Produce2();
        var c = a + b;      
        var d = c + a;      
        var e = d + b;     
        d = AdvSimd.Arm64.VectorTableLookup((d, e, e, b), c); 
G_M53676_IG02:
            movz    x0, #0xD1FFAB1E      // code for helloworld:Produce1():System.Runtime.Intrinsics.Vector128`1[ubyte]
            movk    x0, #0xD1FFAB1E LSL #16
            movk    x0, #0xD1FFAB1E LSL #32
            ldr     x0, [x0]
            blr     x0
            str     q0, [fp, #0x20]	// [V01 loc0]
            movz    x0, #0xD1FFAB1E      // code for helloworld:Produce2():System.Runtime.Intrinsics.Vector128`1[ubyte]
            movk    x0, #0xD1FFAB1E LSL #16
            movk    x0, #0xD1FFAB1E LSL #32
            ldr     x0, [x0]
            blr     x0
            ldr     q16, [fp, #0x20]	// [V01 loc0]
            add     v17.16b, v16.16b, v0.16b
            str     q17, [fp, #0x10]	// [V03 loc2]
            add     v16.16b, v17.16b, v16.16b
            add     v18.16b, v16.16b, v0.16b
            mov     v17.16b, v18.16b
            mov     v19.16b, v0.16b
            ldr     q20, [fp, #0x10]	// [V03 loc2]
            tbl     v16.16b, {v16.16b, v17.16b, v18.16b, v19.16b}, v20.16b
            add     v0.16b, v0.16b, v16.16b

TODO:

  • Explain the design in PR description
  • Fix the summary docs for APIs
  • Get the ValueTuple<> APIs approved Do not expose the API through ref.
  • Add test coverage
  • Analyze and reduce throughput impact
  • Handle the round about case where Rn was assigned and the next refposition should get R0.

Contributes to #1277, #81599

@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation labels Jan 6, 2023
@ghost ghost assigned kunalspathak Jan 6, 2023
@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost
Copy link

ghost commented Jan 6, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

This adds support for remaining variant of VectorTableLookup() intrinsics that takes tuples for table parameter. The tuple can have 2, 3 or 4 Vector128<byte> values. These APIs has a requirement to allocate consecutive registers for the 1st operand value. This PR adds support for that as well.

TODO:

  • Fix the summary docs for APIs
  • Get the ValueTuple<> APIs approved
  • Add test coverage

Contributes to #1277

Author: kunalspathak
Assignees: kunalspathak
Labels:

area-CodeGen-coreclr, new-api-needs-documentation

Milestone: -

@kunalspathak
Copy link
Member Author

Here are few examples of C# and corresponding generated code: https://gist.github.com/kunalspathak/73922d385ea2642192c6167fa4a82778

@kunalspathak kunalspathak mentioned this pull request Mar 30, 2023
@kunalspathak
Copy link
Member Author

I realized that in my recent refactoring, I removed the heuristics that picks the series based on least registers to spill. I will add it back.

Done.

@kunalspathak
Copy link
Member Author

/azp runtime-coreclr jitstressregs

@azure-pipelines
Copy link

Command 'runtime-coreclr' is not supported by Azure Pipelines.

Supported commands
  • help:
    • Get descriptions, examples and documentation about supported commands
    • Example: help "command_name"
  • list:
    • List all pipelines for this repository using a comment.
    • Example: "list"
  • run:
    • Run all pipelines or specific pipelines for this repository using a comment. Use this command by itself to trigger all related pipelines, or specify specific pipelines to run.
    • Example: "run" or "run pipeline_name, pipeline_name, pipeline_name"
  • where:
    • Report back the Azure DevOps orgs that are related to this repository and org
    • Example: "where"

See additional documentation.

@kunalspathak
Copy link
Member Author

/azp run runtime-coreclr jitstressregs

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vargaz
Copy link
Contributor

vargaz commented Mar 31, 2023

This should fix/improve mono support:
vargaz@f247b3c

@kunalspathak
Copy link
Member Author

This should fix/improve mono support: vargaz@f247b3c

Thanks a lot @vargaz for quick fix. Appreciate it.

@kunalspathak
Copy link
Member Author

/azp run runtime-coreclr jitstressregs

Looks like all the VectorTableLookupExtension cases are failing because they are operating on default values. I will investigate and submit a fix.

@kunalspathak
Copy link
Member Author

/azp run runtime-coreclr jitstressregs

Looks like all the VectorTableLookupExtension cases are failing because they are operating on default values. I will investigate and submit a fix.

It seems there is a extra mov added from defaultValues to row1 which is why we are seeing the mismatch:

 defaultValues: (113, 93, 212, 27, 58, 37, 6, 118, 139, 118, 173, 30, 18, 53, 165, 236)
 firstOp: (100, 158, 153, 151, 135, 230, 102, 106, 199, 113, 65, 180, 152, 242, 233, 25)
 secondOp: (191, 98, 103, 254, 0, 35, 2, 121, 98, 115, 15, 220, 239, 184, 9, 46)
 indices: (33, 8, 7, 9, 0, 8, 34, 36, 2, 25, 10, 28, 19, 37, 4, 36)
 result: (113, 139, 118, 118, 113, 139, 6, 118, 212, 115, 173, 239, 254, 53, 58, 236)

image

Will investigate more.

@kunalspathak
Copy link
Member Author

kunalspathak commented Apr 1, 2023

It seems there is a extra mov added from defaultValues to row1 which is why we are seeing the mismatch:

This was because we were overwriting the targetReg with the operand of the operation. I had delayRegFree logic for VectorTableLookupExtension but thought was not necessary at one point. Added it back.

@kunalspathak
Copy link
Member Author

@BruceForstall - this should be ready to review again.

@BruceForstall
Copy link
Member

Diffs

Interesting range of TP diffs, including improvements for non-arm64 code.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kunalspathak
Copy link
Member Author

failures are known issues. Thanks @BruceForstall for the review.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants