Cheaper frame introspection #1011

undingen · 2015-11-17T16:42:25Z

This splits up the handling of deopts and normal frame introspection (e.g. for a traceback).
We have to add to nearly all call sites frame introspection which makes it very important that it does not introduce much overhead over a normal call instruction.
By always storing the user visible variables into a vregs array (layout the same as in the interpreter/bjit) we can make introspection cheaper.
Frame introspection only needs to access user facing variables therefore we don't have to generate extra bytes for spilling variables which get clobbered in the callee because all values we need to access are inside the vregs array.
This let's use remove the 95byte overhead and reduces the stackmap size.
It adds a slight cost of maintaining the vregs array but we were already doing some of this work before with our manual spilling with the additional benefit of faster frame introspection.

The deopts case stays pretty much the same with the exception that we don't add the user visible vars to the stackmap because they are already in the vreg.
We could reduce the overhead by implementing a special "deopt()" function in asm which stores and restores all variables thereby we would not have to manually spill the registers when filling the deopt IC.
Alternatively we could handle it inside llvm by either switching to a stackmap intrinsic which already supports this case or adding it if does not exist...
But I think it's not worth it because deopts should be uncommon...

Before implementing this I thought I would take the "let llvm handle the spilling" approach but I thought that it would introduce so many spills on every call sites that it won't be much faster than explicitly setting the array and it would have had the disadvantage that we don't have a fixed layout which makes future inlining/fast locals() retrieval/etc harder to implement.

I think in the future this approach will also allow inlining of C call sites. The missing piece is just a way to assign a particular IP range to a AST_stmt*. (similar to the EH table mechanism) + changing the source to create a stackmap on the entry of the function which let's us know where the frame_info lives... It maybe possible to just generate a stackmap with the current statement as arg in front and after the call instruction (maybe also add the frame_info or vregs array as arg so that llvm knows it escapes and can't move instructions around).

…ICSetupInfo inside the PatchpointInfo

undingen · 2015-11-18T19:49:16Z

@kmod will this has currently still at least one problem: it attaches the PASSED_CLOSURE_NAME, PASSED_GLOBALS_NAME, ... as stackmap args on call sites and expect them to not be in registers which get clobbered by a call - looks like this does currently never happen that why it works but it's better to make this more robust by explicitly storing this values either inside the frame_info or vregs array).
could you please read my previous comment and maybe even take a look at the draft change to make sure that you agree with this approach before I fix the mentioned issue and get this PR ready for review/merging.

kmod · 2015-11-18T21:50:42Z

couple random thoughts

What if we stack-allocated space for some vregs (just user-visible? just non user-visible? all?) at a fixed offset from the base pointer? My theory is that llvm is already spilling all the variables to the stack, so maybe this won't have too much of a cost. Maybe it would help if we also loaded them from the stack, in addition to just storing them there (so that LLVM doesn't have to keep a second copy around).
It might be overkill, but it might be nice to do more investigation into why we're seeing the slowdown. I guess my current theory is that we end up double-spilling variables (once to the frame vregs, and once to the stack spill slot), but maybe it's GC overhead, or extra instructions, or cache misses...
I'm still surprised that storing the statement is such a cost. I wonder if there are some small optimizations we could do -- maybe we could just store it to the stack frame_Info object rather than in a GC allocation? Also maybe we could create an array of all the AST_stmt* pointers, and instead of storing the statement pointer we store the index into the array.

Anyway, just some small thoughts, I think the overall approach sounds good :)

This splits up the handling of deopts and normal frame introspection (e.g. for a traceback). We have to add to nearly all call sites frame introspection which makes it very important that it does not introduce much overhead over a normal call instruction. By always storing the user visible variables into a vregs array (layout the same as in the interpreter/bjit) we can make introspection cheaper. Frame introspection only needs to access user facing variables therefore we don't have to generate extra bytes for spilling variables which get clobbered in the callee because all values we need to access are inside the vregs array. This let's use remove the 95byte overhead and reduces the stackmap size. It adds a slight cost of maintaining the vregs array but we were already doing some of this work before with our manual spilling with the additional benefit of faster frame introspection. The deopts case stays pretty much the same with the exception that we don't add the user visible vars to the stackmap because they are already in the vreg. We could reduce the overhead by implementing a special "deopt()" function in asm which stores and restores all variables thereby we would not have to manualy spill the registers when filling the deopt IC. Alternatively we could handle it inside llvm by either switching to a stackmap intrinsic which already supports this case or adding it it does not exist... But I think it's not worth it because deopts should be uncommen...

undingen · 2015-11-20T17:37:40Z

This PR is now ready for review.
I'm still experimenting with the regalloc and explicitly storing the current statement but I think this stuff is independent of this PR and should be in a separate PR.

Overall the perf change is not very large but I'm still happy with it and confident because this change should mainly help in speeding up not yet merged patches for signal handling, #1007 and allow future optimization.

Perf comparison 1. run with cleared cache:

              pyston (calibration)                      :   1.02s old: 1.01 (  0.9%)
              pyston django_template3.py                :   2.67s old: 2.83 ( -5.7%)
              pyston pyxl_bench.py                      :   2.11s old: 2.27 ( -6.8%)
              pyston pyxl_bench2.py                     :   1.07s old: 1.15 ( -7.4%)
              pyston sqlalchemy_imperative2.py          :   2.67s old: 2.92 ( -8.4%)
              pyston django_template3_10x.py            :  12.70s old: 13.41 ( -5.3%)
              pyston pyxl_bench_10x.py                  :  12.89s old: 12.97 ( -0.6%)
              pyston pyxl_bench2_10x.py                 :   9.00s old: 8.96 (  0.4%)
              pyston sqlalchemy_imperative2_10x.py      :  16.20s old: 16.57 ( -2.3%)
              pyston (geomean-3133)                     :   2.00s old: 2.15 ( -7.1%)

Perf best of 3 runs:

                                   upstream/master:  origin/new_frame_i:
           django_template3.py             2.1s (4)             2.1s (4)  -0.8%
                 pyxl_bench.py             1.8s (4)             1.7s (4)  -3.9%
     sqlalchemy_imperative2.py             2.3s (4)             2.2s (4)  -2.0%
                pyxl_bench2.py             1.1s (4)             1.1s (4)  -0.1%
       django_template3_10x.py            13.8s (4)            13.8s (4)  -0.2%
             pyxl_bench_10x.py            13.6s (4)            13.5s (4)  -1.0%
 sqlalchemy_imperative2_10x.py            17.4s (4)            16.9s (4)  -2.8%
            pyxl_bench2_10x.py             9.3s (4)             9.2s (4)  -1.7%
                       geomean                 4.8s                 4.7s  -1.6%

kmod · 2015-12-03T23:40:30Z

src/codegen/codegen.cpp

+
+    CFG* cfg = source_info->cfg;
+
+    // Note: due to some (avoidable) restrictions, this check is pretty constrained in where


Minor, but it seems like this comment should stay where it was?

Oh nevermind, should we just kill the old calculateNumVRegs?

kmod · 2015-12-04T00:47:16Z

lgtm :) I think it'd still be good to test loading from the vregs array instead of storing the llvm value -- ie in evalName, have something like loadVregIfUserVisible() instead of returning the llvm::Value that we stored in the symbol_table.

undingen · 2015-12-04T16:39:02Z

I added a commit which makes more clear where we call computeCFG.

Concerning loading the values from the vregs. I tried that but it looked like it does not influence the performance. Maybe it will improve the performance when we teach LLVMs AA that the vregs don't get modified when doing a call/patchpoint to an external function and it's able to propagate store/loads.

Cheaper frame introspection

undingen force-pushed the new_frame_intro2 branch 4 times, most recently from bba7171 to 029b567 Compare November 18, 2015 16:15

undingen added 3 commits November 18, 2015 16:15

move ICSetupInfo before PatchpointInfo in anticipation for accessing …

8e0e07c

…ICSetupInfo inside the PatchpointInfo

move calculateNumVRegs

b145b00

make isCompilerCreatedName() a method of InternedString

1a07ff8

undingen force-pushed the new_frame_intro2 branch from 029b567 to c96a924 Compare November 18, 2015 16:15

undingen force-pushed the new_frame_intro2 branch 5 times, most recently from d542dba to e218661 Compare November 20, 2015 16:33

undingen force-pushed the new_frame_intro2 branch from 3480dd3 to 7b24661 Compare November 20, 2015 17:37

undingen changed the title ~~[WIP] Cheaper frame introspection~~ Cheaper frame introspection Nov 20, 2015

undingen mentioned this pull request Dec 2, 2015

Don't generate patchpoints for calls to fixed functions which don't need patching #1021

Merged

kmod reviewed Dec 3, 2015
View reviewed changes

kmod added the accepted label Dec 4, 2015

make more obvious where we call computeCFG

199448f

undingen force-pushed the new_frame_intro2 branch from 9faa7ac to 199448f Compare December 4, 2015 14:56

undingen added a commit that referenced this pull request Dec 4, 2015

Merge pull request #1011 from undingen/new_frame_intro2

ff657c1

Cheaper frame introspection

undingen merged commit ff657c1 into pyston:master Dec 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cheaper frame introspection #1011

Cheaper frame introspection #1011

undingen commented Nov 17, 2015

undingen commented Nov 18, 2015

kmod commented Nov 18, 2015

undingen commented Nov 20, 2015

kmod Dec 3, 2015

kmod Dec 3, 2015

kmod commented Dec 4, 2015

undingen commented Dec 4, 2015


		CFG* cfg = source_info->cfg;

		// Note: due to some (avoidable) restrictions, this check is pretty constrained in where

Cheaper frame introspection #1011

Cheaper frame introspection #1011

Conversation

undingen commented Nov 17, 2015

undingen commented Nov 18, 2015

kmod commented Nov 18, 2015

undingen commented Nov 20, 2015

kmod Dec 3, 2015

Choose a reason for hiding this comment

kmod Dec 3, 2015

Choose a reason for hiding this comment

kmod commented Dec 4, 2015

undingen commented Dec 4, 2015