Jit: Sub-optimal code when function is inlined #32415

Thealexbarney · 2020-02-17T00:06:22Z

As I created this example code I realized that it happens in similar circumstances as #32414, but I don't know if they have the same cause or not.

Given the following code:
SharpLab link

public readonly ref struct Wrapper
{
    private readonly ReadOnlySpan<byte> _buffer;

    public int Length => _buffer.Length;
    public byte this[int i] => _buffer[i];
    public byte GetUnsafe(int i) => Unsafe.Add(ref MemoryMarshal.GetReference(_buffer), i);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int SumFirst4(Wrapper p)
{
    return p.GetUnsafe(0) + p.GetUnsafe(1) + p.GetUnsafe(2) + p.GetUnsafe(3);
}

public static int SumFirst4Caller(Wrapper p) => SumFirst4(p);

SumFirst4 produces this. There's some weird stuff going on with register allocation, but nothing too bad.

G_M29037_IG01:
						;; bbWeight=1    PerfScore 0.00
G_M29037_IG02:
       mov      rax, bword ptr [rcx]
       mov      rdx, rax
       movzx    rdx, byte  ptr [rdx]
       mov      rcx, rax
       movzx    rcx, byte  ptr [rcx+1]
       add      edx, ecx
       mov      rcx, rax
       movzx    rcx, byte  ptr [rcx+2]
       add      edx, ecx
       movzx    rax, byte  ptr [rax+3]
       add      eax, edx
						;; bbWeight=1    PerfScore 11.50
G_M29037_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00

; Total bytes of code 34

SumFirst4Caller produces this, copying the span to the stack and re-dereferencing the pointer for every element accessed.

G_M45217_IG01:
       sub      rsp, 24
       vzeroupper 
       xor      rax, rax
       mov      qword ptr [rsp+08H], rax
						;; bbWeight=1    PerfScore 2.50
G_M45217_IG02:
       vmovdqu  xmm0, xmmword ptr [rcx]
       vmovdqu  xmmword ptr [rsp+08H], xmm0
						;; bbWeight=1    PerfScore 3.00
G_M45217_IG03:
       lea      rax, bword ptr [rsp+08H]
       mov      rax, bword ptr [rax]
       movzx    rax, byte  ptr [rax]
       lea      rdx, bword ptr [rsp+08H]
       mov      rdx, bword ptr [rdx]
       movzx    rdx, byte  ptr [rdx+1]
       add      eax, edx
       lea      rdx, bword ptr [rsp+08H]
       mov      rdx, bword ptr [rdx]
       movzx    rdx, byte  ptr [rdx+2]
       add      eax, edx
       lea      rdx, bword ptr [rsp+08H]
       mov      rdx, bword ptr [rdx]
       movzx    rdx, byte  ptr [rdx+3]
       add      eax, edx
						;; bbWeight=1    PerfScore 18.75
G_M45217_IG04:
       add      rsp, 24
       ret      
						;; bbWeight=1    PerfScore 1.25

; Total bytes of code 82

For reference, adding this extension and swapping Wrapper for ReadOnlySpan<byte> gives the following code for both SumFirst4 and SumFirst4Caller.

public static byte GetUnsafe(this ReadOnlySpan<byte> span, int i)
{
    return Unsafe.Add(ref MemoryMarshal.GetReference(span), i);
}

G_M57413_IG01:
						;; bbWeight=1    PerfScore 0.00
G_M57413_IG02:
       mov      rax, bword ptr [rcx]
       movzx    rdx, byte  ptr [rax]
       movzx    rcx, byte  ptr [rax+1]
       add      edx, ecx
       movzx    rcx, byte  ptr [rax+2]
       add      edx, ecx
       movzx    rax, byte  ptr [rax+3]
       add      eax, edx
						;; bbWeight=1    PerfScore 10.75
G_M57413_IG03:
       ret      
						;; bbWeight=1    PerfScore 1.00

; Total bytes of code 25

category:cq
theme:structs
skill-level:expert
cost:large
impact:large

The text was updated successfully, but these errors were encountered:

AndyAyersMS · 2020-02-19T01:30:05Z

Likely a similar issue as seen in #32414, but will take a deeper look when I have a chance.

AndyAyersMS · 2020-05-06T18:34:54Z

Need to investigate further so will mark as 5.0 for now.

AndyAyersMS · 2020-06-04T18:28:09Z

Issue here is that we don't promote structs of structs, so the Wrapper struct argument doesn't get promoted, and this leads directly to the various codegen issues.

We won't be able to address this in 5.0.

I don't see a pre-existing issue for general struct promotion. @erozenfeld do you know of one?

erozenfeld · 2020-06-05T21:33:09Z

We have several issues tracking specific improvements for promoting struct-typed fields: #7441, #7576 ,#6707. #6494 was an issue for general struct promotion but it's been closed.

jakobbotsch · 2022-12-12T12:46:39Z

Codegen on main:

; Method Program:SumFirst4Caller(Program+Wrapper):int
G_M57344_IG01:
       sub      rsp, 24
       vzeroupper 
						;; size=7 bbWeight=1    PerfScore 1.25

G_M57344_IG02:
       vmovdqu  xmm0, xmmword ptr [rcx]
       vmovdqu  xmmword ptr [rsp+08H], xmm0
						;; size=10 bbWeight=1    PerfScore 5.00

G_M57344_IG03:
       mov      rax, bword ptr [rsp+08H]
       mov      rdx, rax
       movzx    rdx, byte  ptr [rdx]
       mov      rcx, rax
       movzx    rcx, byte  ptr [rcx+01H]
       add      edx, ecx
       mov      rcx, rax
       movzx    rcx, byte  ptr [rcx+02H]
       add      edx, ecx
       movzx    rax, byte  ptr [rax+03H]
       add      eax, edx
						;; size=35 bbWeight=1    PerfScore 10.50

G_M57344_IG04:
       add      rsp, 24
       ret      
						;; size=5 bbWeight=1    PerfScore 1.25
; Total bytes of code: 57

Looks much better but still have the unnecessary copy. Generalized promotion would probably be able to get rid of this.

jakobbotsch · 2023-06-16T20:21:30Z

With physical promotion we can potentially get the following codegen for both methods:

; Assembly listing for method Program:SumFirst4(Program+Wrapper):int
; Emitting BLENDED_CODE for X64 with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 8 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  6   )   byref  ->  rcx         ld-addr-op single-def
;# V01 OutArgs      [V01    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V02 tmp1         [V02,T02] (  2,  4   )     int  ->  rdx         "impAppendStmt"
;  V03 tmp2         [V03,T03] (  2,  4   )     int  ->  rdx         "impAppendStmt"
;  V04 tmp3         [V04,T04] (  2,  4   )     int  ->  rdx         "impAppendStmt"
;* V05 tmp4         [V05    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V06 tmp5         [V06    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V07 tmp6         [V07    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V08 tmp7         [V08    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;* V09 tmp8         [V09    ] (  0,  0   )   byref  ->  zero-ref    single-def V05._reference(offs=0x00) P-INDEP "field V05._reference (fldOffset=0x0)"
;* V10 tmp9         [V10    ] (  0,  0   )     int  ->  zero-ref    V05._length(offs=0x08) P-INDEP "field V05._length (fldOffset=0x8)"
;* V11 tmp10        [V11    ] (  0,  0   )   byref  ->  zero-ref    single-def V06._reference(offs=0x00) P-INDEP "field V06._reference (fldOffset=0x0)"
;* V12 tmp11        [V12    ] (  0,  0   )     int  ->  zero-ref    V06._length(offs=0x08) P-INDEP "field V06._length (fldOffset=0x8)"
;* V13 tmp12        [V13    ] (  0,  0   )   byref  ->  zero-ref    single-def V07._reference(offs=0x00) P-INDEP "field V07._reference (fldOffset=0x0)"
;* V14 tmp13        [V14    ] (  0,  0   )     int  ->  zero-ref    V07._length(offs=0x08) P-INDEP "field V07._length (fldOffset=0x8)"
;* V15 tmp14        [V15    ] (  0,  0   )   byref  ->  zero-ref    single-def V08._reference(offs=0x00) P-INDEP "field V08._reference (fldOffset=0x0)"
;* V16 tmp15        [V16    ] (  0,  0   )     int  ->  zero-ref    V08._length(offs=0x08) P-INDEP "field V08._length (fldOffset=0x8)"
;  V17 tmp16        [V17,T01] (  5,  5   )   byref  ->  rax         single-def "V00.[000..008)"
;* V18 tmp17        [V18,T05] (  0,  0   )     int  ->  zero-ref    "V00.[008..012)"
;
; Lcl frame size = 0

G_M29397_IG01:  ;; offset=0000H
                                                ;; size=0 bbWeight=1 PerfScore 0.00
G_M29397_IG02:  ;; offset=0000H
       mov      rax, bword ptr [rcx]
       movzx    rdx, byte  ptr [rax]
       movzx    rcx, byte  ptr [rax+01H]
       add      edx, ecx
       movzx    rcx, byte  ptr [rax+02H]
       add      edx, ecx
       movzx    rax, byte  ptr [rax+03H]
       add      eax, edx
                                                ;; size=24 bbWeight=1 PerfScore 10.75
G_M29397_IG03:  ;; offset=0018H
       ret
                                                ;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 25, prolog size 0, PerfScore 14.25, instruction count 9, allocated bytes for code 25 (MethodHash=92b18d2a) for method Program:SumFirst4(Program+Wrapper):int

It requires two changes:

JIT: Add support for induced field accesses in physical promotion #87410 needs to be expanded a bit (in a separate PR) to realize that it should run the induce step even when there are no physical promotions if there are assignments between regularly promoted structs and candidates. Otherwise we just exit immediately after the first pass of promotions because we have no LCL_FLD uses.

Even with that, we need to resolve the following TODO to get the good codegen:

runtime/src/coreclr/jit/promotion.cpp

Lines 1116 to 1125 in d334f0c

    
           if (size == 0) 
        
           { 
        
               // TODO-CQ: Recursively handle padding in sub structures 
        
               // here. Might be better to introduce a single JIT-EE call 
        
               // to query the significant segments -- that would also be 
        
               // usable by R2R even outside the version bubble in many 
        
               // cases. 
        
               size = compHnd->getClassSize(fieldClassHandle); 
        
               assert(size != 0); 
        
           }

I don't think these tasks should be difficult so I will grab this one @AndyAyersMS.

Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Feb 17, 2020

AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label May 6, 2020

AndyAyersMS added this to the 5.0 milestone May 6, 2020

BruceForstall assigned AndyAyersMS Jun 2, 2020

AndyAyersMS modified the milestones: 5.0, Future Jun 4, 2020

jakobbotsch mentioned this issue Dec 12, 2022

JIT: Generalized struct promotion #76928

Closed

40 tasks

jakobbotsch assigned jakobbotsch and unassigned AndyAyersMS Jun 16, 2023

jakobbotsch modified the milestones: Future, 8.0.0 Jun 16, 2023

jakobbotsch mentioned this issue Jun 16, 2023

Span<T> wrapped in a struct isn't performing as fast as it could be #68797

Closed

jakobbotsch added the Priority:2 Work that is important, but not critical for the release label Jun 20, 2023

jakobbotsch mentioned this issue Jun 27, 2023

JIT: Enable physical promotion by default #88090

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Jun 27, 2023

jakobbotsch closed this as completed in #88090 Jun 29, 2023

jakobbotsch closed this as completed in 9dcc7b1 Jun 29, 2023

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 29, 2023

ghost locked as resolved and limited conversation to collaborators Jul 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jit: Sub-optimal code when function is inlined #32415

Jit: Sub-optimal code when function is inlined #32415

Thealexbarney commented Feb 17, 2020 •

edited

Loading

AndyAyersMS commented Feb 19, 2020

AndyAyersMS commented May 6, 2020

AndyAyersMS commented Jun 4, 2020

erozenfeld commented Jun 5, 2020

jakobbotsch commented Dec 12, 2022

jakobbotsch commented Jun 16, 2023 •

edited

Loading

Jit: Sub-optimal code when function is inlined #32415

Jit: Sub-optimal code when function is inlined #32415

Comments

Thealexbarney commented Feb 17, 2020 • edited Loading

AndyAyersMS commented Feb 19, 2020

AndyAyersMS commented May 6, 2020

AndyAyersMS commented Jun 4, 2020

erozenfeld commented Jun 5, 2020

jakobbotsch commented Dec 12, 2022

jakobbotsch commented Jun 16, 2023 • edited Loading

Thealexbarney commented Feb 17, 2020 •

edited

Loading

jakobbotsch commented Jun 16, 2023 •

edited

Loading