Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

WIP: Rematerialize constants in LSRA. #12250

Closed
wants to merge 1 commit into from
Closed

Conversation

pgavlin
Copy link

@pgavlin pgavlin commented Jun 13, 2017

Rather than reloading spilled constants LSRA, rematerialize them at the
point of use. This avoids the need to insert spill and reload code.

This change also tweaks the allocator's heuristics slightly:

  • constant intervals are only allocated a register at the point of def
    if there is an appropriate register available; otherwise, they are
    spilled and rematerialized at the point of use
  • ref positions that are freely rematerializable are assigned the lowest
    possible weight.

Rather than reloading spilled constants LSRA, rematerialize them at the
point of use. This avoids the need to insert spill and reload code.

This change also tweaks the allocator's heuristics slightly:
- constant intervals are only allocated a register at the point of def
  if there is an appropriate register available; otherwise, they are
  spilled and rematerialized at the point of use
- ref positions that are freely rematerializable are assigned the lowest
  possible weight.
@pgavlin
Copy link
Author

pgavlin commented Jun 13, 2017

@CarolEidt PTAL.

I do not expect that this is the final form that this will or should take: for example, the decision point for rematerialization needs to use better abstractions (currently it literally checks for spilled constant intervals). I think that this is a reasonable starting point, however.

From jit-diff x86/Release/Windows:

Summary:
(Note: Lower is better)

Total bytes of diff: -1657 (-0.02 % of base)
    diff is an improvement.

Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes

Top file regressions by size (bytes):
           4 : System.Runtime.Extensions.dasm (0.00 % of base)

Top file improvements by size (bytes):
        -524 : System.Private.CoreLib.dasm (-0.02 % of base)
        -241 : System.Runtime.Numerics.dasm (-0.42 % of base)
        -219 : System.Reflection.Metadata.dasm (-0.27 % of base)
        -187 : Microsoft.CodeAnalysis.CSharp.dasm (-0.01 % of base)
        -119 : Microsoft.CodeAnalysis.dasm (-0.02 % of base)

22 total files with size differences (21 improved, 1 regressed).

Top method regessions by size (bytes):
          19 : System.Reflection.Metadata.dasm - AssemblyTableReader:GetVersion():ref:this
          12 : System.Reflection.Metadata.dasm - AssemblyRefTableReader:GetVersion(int):ref:this
          12 : System.Runtime.Extensions.dasm - WebUtility:UrlEncodeToBytes(ref,int,int):ref
           5 : System.Reflection.Metadata.dasm - MemoryBlock:BinarySearchForSlot(int,int,int,int,bool):int:this
           4 : Microsoft.CodeAnalysis.VisualBasic.dasm - Scanner:RescanTrailingColonAsToken(byref,byref):this

Top method improvements by size (bytes):
        -147 : System.Runtime.Numerics.dasm - BigIntegerCalculator:ExtractDigits(byref,byref,byref,byref)
         -84 : Microsoft.CodeAnalysis.dasm - Compilation:ConstructModuleSerializationProperties(ref,ref,struct):ref:this
         -72 : System.Reflection.Metadata.dasm - MetadataReader:InitializeTableReaders(struct,ubyte,ref,ref):this
         -72 : System.Runtime.Numerics.dasm - BigIntegerCalculator:LehmerCore(byref,byref,long,long,long,long)
         -71 : System.Reflection.Metadata.dasm - EventDefinition:GetAccessors():struct:this

95 total methods with size differences (85 improved, 10 regressed).

Representative diffs:

 G_M34994_IG05:
        mov      dword ptr [ebp-78H], eax
        push     eax
        push     ebx
-       xor      edx, edx
-       mov      dword ptr [ebp-FCH], edx
        mov      ecx, dword ptr [ebp-28H]
        mov      dword ptr [ebp-94H], ecx
-       mov      edx, dword ptr [ebp-FCH]
+       xor      edx, edx
        mov      dword ptr [ebp-98H], edx
        push     edx
        push     dword ptr [ebp-94H]
        call     [CORINFO_HELP_LMUL]
        jae      G_M13864_IG10
        mov      edi, dword ptr [eax+4*ecx+4]
-       mov      dword ptr [ebp-A8H], edi
-       xor      edi, edi
-       mov      dword ptr [ebp-ACH], edi
-       mov      edi, dword ptr [ebp-A8H]
        mov      dword ptr [ebp-50H], edi
-       mov      edi, dword ptr [ebp-ACH]
+       xor      edi, edi
        mov      dword ptr [ebp-54H], edi
        lea      edi, [ecx-2]
        cmp      edi, esi
        jae      G_M13864_IG10

I also ran internal diffs on x64/Windows; these showed about a 30k improvement with 2 minor regressions.

@mikedn
Copy link

mikedn commented Jun 13, 2017

Is it me or this can move zeroes forward? That xor edi, edi in your example changes condition flags and the IR produced by decomposition and lowering may not expect that to happen at that point.

@pgavlin
Copy link
Author

pgavlin commented Jun 13, 2017

Is it me or this can move zeroes forward? That xor edi, edi in your example changes condition flags and the IR produced by decomposition and lowering may not expect that to happen at that point.

It can pull constants forward, yes. These are all constant nodes, though (e.g. GT_CNS_INT), which should be side-effect-free and therefore always safe to move. It would be news to me if the backend expects these nodes to have side effects besides writing the destination register.

@mikedn
Copy link

mikedn commented Jun 13, 2017

which should be side-effect-free and therefore always safe to move

Well, we do not model condition flags. If you have GT_SUB_LO followed by GT_SUB_HI and a GT_CNS_INT(0) lands between them we have a problem.

@pgavlin
Copy link
Author

pgavlin commented Jun 13, 2017

Well, we do not model condition flags. If you have GT_SUB_LO followed by GT_SUB_HI and a GT_CNS_INT(0) lands between them we have a problem.

Hmm, that is an interesting point--thanks for clarifying. I'll have to think about that a bit.

@pgavlin
Copy link
Author

pgavlin commented Jun 13, 2017

Yeah, it looks like this might be a non-starter without modeling the flags in LSRA: a relatively simple approach (i.e. using mov reg, 0 instead of xor reg, reg for rematerialized constant zeros) turns this into a regression. I'm trying something a bit more sophisticated now, but I'm not terribly hopeful.

@CarolEidt
Copy link

Yeah, it looks like this might be a non-starter without modeling the flags in LSRA

I don't think it would be all that difficult to model the flags in LSRA in a very limited way for this scenario, since it is in fact linear. One possible approach would be that in Lowering we could tag all the nodes that produce flag values that are subsequently consumed. Then, during RefPosition building, any such node that is encountered would invalidate the rematerializability of any constant whose consuming node hasn't yet been encountered.

@pgavlin
Copy link
Author

pgavlin commented Jun 14, 2017

One possible approach would be that in Lowering we could tag all the nodes that produce flag values that are subsequently consumed. Then, during RefPosition building, any such node that is encountered would invalidate the rematerializability of any constant whose consuming node hasn't yet been encountered.

This is roughly the approach I was considering. We would also probably need to tag the nodes that consume previously-produced flag values.

@mikedn
Copy link

mikedn commented Jun 14, 2017

One possible approach would be that in Lowering we could tag all the nodes that produce flag values that are subsequently consumed. Then, during RefPosition building, any such node that is encountered would invalidate the rematerializability of any constant whose consuming node hasn't yet been encountered.

What happens if the flags producer and consumer(s) are in different blocks?

@CarolEidt
Copy link

What happens if the flags producer and consumer(s) are in different blocks?

Good point. we would presumably have to also tag blocks where the flags are live-in or live-out.

@pgavlin
Copy link
Author

pgavlin commented Jun 14, 2017

Good point. we would presumably have to also tag blocks where the flags are live-in or live-out.

I happened to discuss flags modeling in the IR with @russellhadley a bit recently. We were kicking around the idea that we might treat the flags register as a special tracked lclVar (maybe not literally, if only to avoid kicking out other tracked locals). That would give us most of what we need here, though it may be a bit heavyweight for now. Tagging blocks and nodes as flags live in/out and def'd/used/consumed (respectively) seems like it should be good enough.

@mikedn
Copy link

mikedn commented Jun 14, 2017

We were kicking around the idea that we might treat the flags register as a special tracked lclVar (maybe not literally, if only to avoid kicking out other tracked locals).

I thought about that but it seems problematic if you want to reuse the flags set by a non-compare instruction (ADD, OR etc.). Even if you don't need to the actual value of the node (if you need it then you're dealing with a node that produces multiple values) you still need to distinguish between the actual value and the flags value.

I'm quite interested in this flags issue due to the various experiments I have in #10861

@pgavlin
Copy link
Author

pgavlin commented Jun 14, 2017

I thought about that but it seems problematic if you want to reuse the flags set by a non-compare instruction (ADD, OR etc.). Even if you don't need to the actual value of the node (if you need it then you're dealing with a node that produces multiple values) you still need to distinguish between the actual value and the flags value.

Can you clarify why this seems like it makes a lclVar-style solution problematic? We would simply end up with nodes that are modeled as producing an SDSU temp as well as writing to the flags "lclVar".

@mikedn
Copy link

mikedn commented Jun 14, 2017

as well as writing to the flags "lclVar"

Is writing to that special lclvar implicit rather than explicit?

@pgavlin
Copy link
Author

pgavlin commented Jun 14, 2017

Is writing to that special lclvar implicit rather than explicit?

If by implicit you mean "not represented by a STORE_LCL_VAR node", then yes, it would have to be.

@pgavlin
Copy link
Author

pgavlin commented Jun 14, 2017

(without some probably-invasive IR changes, that is)

@mikedn
Copy link

mikedn commented Jun 14, 2017

If by implicit you mean "not represented by a STORE_LCL_VAR node"

Yes, thanks. In that case it's not clear what a lclvar would buy us.

@CarolEidt
Copy link

We were kicking around the idea that we might treat the flags register as a special tracked lclVar

It's unclear to me what a lclVar buys you. And it adds complexity to handle a lclVar that has neither a stack location nor a register (unless we go down the even more complex path of saving and restoring flag values.

@pgavlin
Copy link
Author

pgavlin commented Jun 14, 2017

Yes, thanks. In that case it's not clear what a lclvar would buy us.

It's unclear to me what a lclVar buys you. And it adds complexity to handle a lclVar that has neither a stack location nor a register (unless we go down the even more complex path of saving and restoring flag values.

The flags register has lclVar-like semantics in that it is a multi-def, multi-use location that is read/written by certain operators. Representing it as an actual lclVar might obviate the need to add special-case code in the various places in which we will otherwise need to (especially liveness and side-effect analysis). This might just move these special cases to lclVars in general, but that could be beneficial if we were to decide e.g. to represent physical registers as lclVars as well.

@AndyAyersMS
Copy link
Member

Isn't there already some mechanism to track physical register liveness? EG instructions that write specific or multiple registers, ABI kill sets, etc?

Seems like this could possibly be extended to cover flag liveness instead of some virtualized resource.

Also note there are some cases where flags are partially written so it is perhaps more than read/write that needs modelling.

@pgavlin
Copy link
Author

pgavlin commented Jun 14, 2017

Also note there are some cases where flags are partially written so it is perhaps more than read/write that needs modelling.

Yes, we could get more granular w.r.t. which flags are written. As a first pass I hope that we don't need to, but it's certainly worth considering.

@russellhadley
Copy link

I've been following along and wanted to layout what I think are the requirements/consequences of using execution order in LIR - which these first simple recalculation opts are hitting.

  • We will need a general mechanism for tracking flags/physical resource liveness so that reordering of operations can be safely performed. This could be as simple as a ondemand conservative liveness calculation utilites (anticipated def/upward exposed use), or a first order construct that participates in liveness.
  • While it would be great to avoid it, some of these lifetimes will cross block boundaries. Flags in particular for decomp and branch optimizations. We will need some what to capture this in the interblock liveness.

We're going to keep hitting this, and likely more often, now that we're going to start pushing for more perf in 2.1. New issue opened to track this #12275

@karelz
Copy link
Member

karelz commented Aug 25, 2017

What is status of this PR? 2+ months without any update ...

@pgavlin
Copy link
Author

pgavlin commented Aug 25, 2017

In progress. Some prerequisites landed recently.

@karelz
Copy link
Member

karelz commented Aug 25, 2017

Would it make sense to close it until it is ready for further work?

@karelz
Copy link
Member

karelz commented Aug 30, 2017

Closing for now, feel free to reopen when you're ready to work on it again ...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants