Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jit: Sub-optimal code when function is inlined #32415

Closed
Thealexbarney opened this issue Feb 17, 2020 · 6 comments · Fixed by #88090
Closed

Jit: Sub-optimal code when function is inlined #32415

Thealexbarney opened this issue Feb 17, 2020 · 6 comments · Fixed by #88090
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release
Milestone

Comments

@Thealexbarney
Copy link

Thealexbarney commented Feb 17, 2020

As I created this example code I realized that it happens in similar circumstances as #32414, but I don't know if they have the same cause or not.

Given the following code:
SharpLab link

public readonly ref struct Wrapper
{
    private readonly ReadOnlySpan<byte> _buffer;

    public int Length => _buffer.Length;
    public byte this[int i] => _buffer[i];
    public byte GetUnsafe(int i) => Unsafe.Add(ref MemoryMarshal.GetReference(_buffer), i);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int SumFirst4(Wrapper p)
{
    return p.GetUnsafe(0) + p.GetUnsafe(1) + p.GetUnsafe(2) + p.GetUnsafe(3);
}

public static int SumFirst4Caller(Wrapper p) => SumFirst4(p);

SumFirst4 produces this. There's some weird stuff going on with register allocation, but nothing too bad.

G_M29037_IG01:
						;; bbWeight=1    PerfScore 0.00
G_M29037_IG02:
       mov      rax, bword ptr [rcx]
       mov      rdx, rax
       movzx    rdx, byte  ptr [rdx]
       mov      rcx, rax
       movzx    rcx, byte  ptr [rcx+1]
       add      edx, ecx
       mov      rcx, rax
       movzx    rcx, byte  ptr [rcx+2]
       add      edx, ecx
       movzx    rax, byte  ptr [rax+3]
       add      eax, edx
						;; bbWeight=1    PerfScore 11.50
G_M29037_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00

; Total bytes of code 34

SumFirst4Caller produces this, copying the span to the stack and re-dereferencing the pointer for every element accessed.

G_M45217_IG01:
       sub      rsp, 24
       vzeroupper 
       xor      rax, rax
       mov      qword ptr [rsp+08H], rax
						;; bbWeight=1    PerfScore 2.50
G_M45217_IG02:
       vmovdqu  xmm0, xmmword ptr [rcx]
       vmovdqu  xmmword ptr [rsp+08H], xmm0
						;; bbWeight=1    PerfScore 3.00
G_M45217_IG03:
       lea      rax, bword ptr [rsp+08H]
       mov      rax, bword ptr [rax]
       movzx    rax, byte  ptr [rax]
       lea      rdx, bword ptr [rsp+08H]
       mov      rdx, bword ptr [rdx]
       movzx    rdx, byte  ptr [rdx+1]
       add      eax, edx
       lea      rdx, bword ptr [rsp+08H]
       mov      rdx, bword ptr [rdx]
       movzx    rdx, byte  ptr [rdx+2]
       add      eax, edx
       lea      rdx, bword ptr [rsp+08H]
       mov      rdx, bword ptr [rdx]
       movzx    rdx, byte  ptr [rdx+3]
       add      eax, edx
						;; bbWeight=1    PerfScore 18.75
G_M45217_IG04:
       add      rsp, 24
       ret      
						;; bbWeight=1    PerfScore 1.25

; Total bytes of code 82

For reference, adding this extension and swapping Wrapper for ReadOnlySpan<byte> gives the following code for both SumFirst4 and SumFirst4Caller.

public static byte GetUnsafe(this ReadOnlySpan<byte> span, int i)
{
    return Unsafe.Add(ref MemoryMarshal.GetReference(span), i);
}

G_M57413_IG01:
						;; bbWeight=1    PerfScore 0.00
G_M57413_IG02:
       mov      rax, bword ptr [rcx]
       movzx    rdx, byte  ptr [rax]
       movzx    rcx, byte  ptr [rax+1]
       add      edx, ecx
       movzx    rcx, byte  ptr [rax+2]
       add      edx, ecx
       movzx    rax, byte  ptr [rax+3]
       add      eax, edx
						;; bbWeight=1    PerfScore 10.75
G_M57413_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00

; Total bytes of code 25

category:cq
theme:structs
skill-level:expert
cost:large
impact:large

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Feb 17, 2020
@AndyAyersMS
Copy link
Member

Likely a similar issue as seen in #32414, but will take a deeper look when I have a chance.

@AndyAyersMS
Copy link
Member

Need to investigate further so will mark as 5.0 for now.

@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label May 6, 2020
@AndyAyersMS AndyAyersMS added this to the 5.0 milestone May 6, 2020
@AndyAyersMS
Copy link
Member

Issue here is that we don't promote structs of structs, so the Wrapper struct argument doesn't get promoted, and this leads directly to the various codegen issues.

We won't be able to address this in 5.0.

I don't see a pre-existing issue for general struct promotion. @erozenfeld do you know of one?

@AndyAyersMS AndyAyersMS modified the milestones: 5.0, Future Jun 4, 2020
@erozenfeld
Copy link
Member

We have several issues tracking specific improvements for promoting struct-typed fields: #7441, #7576 ,#6707. #6494 was an issue for general struct promotion but it's been closed.

@jakobbotsch
Copy link
Member

Codegen on main:

; Method Program:SumFirst4Caller(Program+Wrapper):int
G_M57344_IG01:
       sub      rsp, 24
       vzeroupper 
						;; size=7 bbWeight=1    PerfScore 1.25

G_M57344_IG02:
       vmovdqu  xmm0, xmmword ptr [rcx]
       vmovdqu  xmmword ptr [rsp+08H], xmm0
						;; size=10 bbWeight=1    PerfScore 5.00

G_M57344_IG03:
       mov      rax, bword ptr [rsp+08H]
       mov      rdx, rax
       movzx    rdx, byte  ptr [rdx]
       mov      rcx, rax
       movzx    rcx, byte  ptr [rcx+01H]
       add      edx, ecx
       mov      rcx, rax
       movzx    rcx, byte  ptr [rcx+02H]
       add      edx, ecx
       movzx    rax, byte  ptr [rax+03H]
       add      eax, edx
						;; size=35 bbWeight=1    PerfScore 10.50

G_M57344_IG04:
       add      rsp, 24
       ret      
						;; size=5 bbWeight=1    PerfScore 1.25
; Total bytes of code: 57

Looks much better but still have the unnecessary copy. Generalized promotion would probably be able to get rid of this.

@jakobbotsch
Copy link
Member

jakobbotsch commented Jun 16, 2023

With physical promotion we can potentially get the following codegen for both methods:

; Assembly listing for method Program:SumFirst4(Program+Wrapper):int
; Emitting BLENDED_CODE for X64 with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 8 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  6   )   byref  ->  rcx         ld-addr-op single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V02 tmp1         [V02,T02] (  2,  4   )     int  ->  rdx         "impAppendStmt"
;  V03 tmp2         [V03,T03] (  2,  4   )     int  ->  rdx         "impAppendStmt"
;  V04 tmp3         [V04,T04] (  2,  4   )     int  ->  rdx         "impAppendStmt"
;* V05 tmp4         [V05    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V06 tmp5         [V06    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V07 tmp6         [V07    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V08 tmp7         [V08    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V09 tmp8         [V09    ] (  0,  0   )   byref  ->  zero-ref    single-def V05._reference(offs=0x00) P-INDEP "field V05._reference (fldOffset=0x0)"
;* V10 tmp9         [V10    ] (  0,  0   )     int  ->  zero-ref    V05._length(offs=0x08) P-INDEP "field V05._length (fldOffset=0x8)"
;* V11 tmp10        [V11    ] (  0,  0   )   byref  ->  zero-ref    single-def V06._reference(offs=0x00) P-INDEP "field V06._reference (fldOffset=0x0)"
;* V12 tmp11        [V12    ] (  0,  0   )     int  ->  zero-ref    V06._length(offs=0x08) P-INDEP "field V06._length (fldOffset=0x8)"
;* V13 tmp12        [V13    ] (  0,  0   )   byref  ->  zero-ref    single-def V07._reference(offs=0x00) P-INDEP "field V07._reference (fldOffset=0x0)"
;* V14 tmp13        [V14    ] (  0,  0   )     int  ->  zero-ref    V07._length(offs=0x08) P-INDEP "field V07._length (fldOffset=0x8)"
;* V15 tmp14        [V15    ] (  0,  0   )   byref  ->  zero-ref    single-def V08._reference(offs=0x00) P-INDEP "field V08._reference (fldOffset=0x0)"
;* V16 tmp15        [V16    ] (  0,  0   )     int  ->  zero-ref    V08._length(offs=0x08) P-INDEP "field V08._length (fldOffset=0x8)"
;  V17 tmp16        [V17,T01] (  5,  5   )   byref  ->  rax         single-def "V00.[000..008)"
;* V18 tmp17        [V18,T05] (  0,  0   )     int  ->  zero-ref    "V00.[008..012)"
;
; Lcl frame size = 0

G_M29397_IG01:  ;; offset=0000H
                                                ;; size=0 bbWeight=1 PerfScore 0.00
G_M29397_IG02:  ;; offset=0000H
       mov      rax, bword ptr [rcx]
       movzx    rdx, byte  ptr [rax]
       movzx    rcx, byte  ptr [rax+01H]
       add      edx, ecx
       movzx    rcx, byte  ptr [rax+02H]
       add      edx, ecx
       movzx    rax, byte  ptr [rax+03H]
       add      eax, edx
                                                ;; size=24 bbWeight=1 PerfScore 10.75
G_M29397_IG03:  ;; offset=0018H
       ret
                                                ;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 25, prolog size 0, PerfScore 14.25, instruction count 9, allocated bytes for code 25 (MethodHash=92b18d2a) for method Program:SumFirst4(Program+Wrapper):int

It requires two changes:

  1. JIT: Add support for induced field accesses in physical promotion #87410 needs to be expanded a bit (in a separate PR) to realize that it should run the induce step even when there are no physical promotions if there are assignments between regularly promoted structs and candidates. Otherwise we just exit immediately after the first pass of promotions because we have no LCL_FLD uses.
  2. Even with that, we need to resolve the following TODO to get the good codegen:
    if (size == 0)
    {
    // TODO-CQ: Recursively handle padding in sub structures
    // here. Might be better to introduce a single JIT-EE call
    // to query the significant segments -- that would also be
    // usable by R2R even outside the version bubble in many
    // cases.
    size = compHnd->getClassSize(fieldClassHandle);
    assert(size != 0);
    }

I don't think these tasks should be difficult so I will grab this one @AndyAyersMS.

@jakobbotsch jakobbotsch modified the milestones: Future, 8.0.0 Jun 16, 2023
@jakobbotsch jakobbotsch added the Priority:2 Work that is important, but not critical for the release label Jun 20, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 27, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 29, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jul 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants