Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cheaper frame introspection #1011

Merged
merged 5 commits into from
Dec 4, 2015
Merged

Conversation

undingen
Copy link
Contributor

This splits up the handling of deopts and normal frame introspection (e.g. for a traceback).
We have to add to nearly all call sites frame introspection which makes it very important that it does not introduce much overhead over a normal call instruction.
By always storing the user visible variables into a vregs array (layout the same as in the interpreter/bjit) we can make introspection cheaper.
Frame introspection only needs to access user facing variables therefore we don't have to generate extra bytes for spilling variables which get clobbered in the callee because all values we need to access are inside the vregs array.
This let's use remove the 95byte overhead and reduces the stackmap size.
It adds a slight cost of maintaining the vregs array but we were already doing some of this work before with our manual spilling with the additional benefit of faster frame introspection.

The deopts case stays pretty much the same with the exception that we don't add the user visible vars to the stackmap because they are already in the vreg.
We could reduce the overhead by implementing a special "deopt()" function in asm which stores and restores all variables thereby we would not have to manually spill the registers when filling the deopt IC.
Alternatively we could handle it inside llvm by either switching to a stackmap intrinsic which already supports this case or adding it if does not exist...
But I think it's not worth it because deopts should be uncommon...

Before implementing this I thought I would take the "let llvm handle the spilling" approach but I thought that it would introduce so many spills on every call sites that it won't be much faster than explicitly setting the array and it would have had the disadvantage that we don't have a fixed layout which makes future inlining/fast locals() retrieval/etc harder to implement.

I think in the future this approach will also allow inlining of C call sites. The missing piece is just a way to assign a particular IP range to a AST_stmt*. (similar to the EH table mechanism) + changing the source to create a stackmap on the entry of the function which let's us know where the frame_info lives... It maybe possible to just generate a stackmap with the current statement as arg in front and after the call instruction (maybe also add the frame_info or vregs array as arg so that llvm knows it escapes and can't move instructions around).

@undingen undingen force-pushed the new_frame_intro2 branch 4 times, most recently from bba7171 to 029b567 Compare November 18, 2015 16:15
@undingen
Copy link
Contributor Author

@kmod will this has currently still at least one problem: it attaches the PASSED_CLOSURE_NAME, PASSED_GLOBALS_NAME, ... as stackmap args on call sites and expect them to not be in registers which get clobbered by a call - looks like this does currently never happen that why it works but it's better to make this more robust by explicitly storing this values either inside the frame_info or vregs array).
could you please read my previous comment and maybe even take a look at the draft change to make sure that you agree with this approach before I fix the mentioned issue and get this PR ready for review/merging.

@kmod
Copy link
Collaborator

kmod commented Nov 18, 2015

couple random thoughts

  • What if we stack-allocated space for some vregs (just user-visible? just non user-visible? all?) at a fixed offset from the base pointer? My theory is that llvm is already spilling all the variables to the stack, so maybe this won't have too much of a cost. Maybe it would help if we also loaded them from the stack, in addition to just storing them there (so that LLVM doesn't have to keep a second copy around).
  • It might be overkill, but it might be nice to do more investigation into why we're seeing the slowdown. I guess my current theory is that we end up double-spilling variables (once to the frame vregs, and once to the stack spill slot), but maybe it's GC overhead, or extra instructions, or cache misses...
  • I'm still surprised that storing the statement is such a cost. I wonder if there are some small optimizations we could do -- maybe we could just store it to the stack frame_Info object rather than in a GC allocation? Also maybe we could create an array of all the AST_stmt* pointers, and instead of storing the statement pointer we store the index into the array.

Anyway, just some small thoughts, I think the overall approach sounds good :)

@undingen undingen force-pushed the new_frame_intro2 branch 5 times, most recently from d542dba to e218661 Compare November 20, 2015 16:33
This splits up the handling of deopts and normal frame introspection (e.g. for a traceback).
We have to add to nearly all call sites frame introspection which makes it very important that it does not introduce much overhead over a normal call instruction.
By always storing the user visible variables into a vregs array (layout the same as in the interpreter/bjit) we can make introspection cheaper.
Frame introspection only needs to access user facing variables therefore we don't have to generate extra bytes for spilling variables which get clobbered in the callee because all values we need to access are inside the vregs array.
This let's use remove the 95byte overhead and reduces the stackmap size.
It adds a slight cost of maintaining the vregs array but we were already doing some of this work before with our manual spilling with the additional benefit of faster frame introspection.

The deopts case stays pretty much the same with the exception that we don't add the user visible vars to the stackmap because they are already in the vreg.
We could reduce the overhead by implementing a special "deopt()" function in asm which stores and restores all variables thereby we would not have to manualy spill the registers when filling the deopt IC.
Alternatively we could handle it inside llvm by either switching to a stackmap intrinsic which already supports this case or adding it it does not exist...
But I think it's not worth it because deopts should be uncommen...
@undingen
Copy link
Contributor Author

This PR is now ready for review.
I'm still experimenting with the regalloc and explicitly storing the current statement but I think this stuff is independent of this PR and should be in a separate PR.

Overall the perf change is not very large but I'm still happy with it and confident because this change should mainly help in speeding up not yet merged patches for signal handling, #1007 and allow future optimization.

Perf comparison 1. run with cleared cache:

              pyston (calibration)                      :   1.02s old: 1.01 (  0.9%)
              pyston django_template3.py                :   2.67s old: 2.83 ( -5.7%)
              pyston pyxl_bench.py                      :   2.11s old: 2.27 ( -6.8%)
              pyston pyxl_bench2.py                     :   1.07s old: 1.15 ( -7.4%)
              pyston sqlalchemy_imperative2.py          :   2.67s old: 2.92 ( -8.4%)
              pyston django_template3_10x.py            :  12.70s old: 13.41 ( -5.3%)
              pyston pyxl_bench_10x.py                  :  12.89s old: 12.97 ( -0.6%)
              pyston pyxl_bench2_10x.py                 :   9.00s old: 8.96 (  0.4%)
              pyston sqlalchemy_imperative2_10x.py      :  16.20s old: 16.57 ( -2.3%)
              pyston (geomean-3133)                     :   2.00s old: 2.15 ( -7.1%)

Perf best of 3 runs:

                                   upstream/master:  origin/new_frame_i:
           django_template3.py             2.1s (4)             2.1s (4)  -0.8%
                 pyxl_bench.py             1.8s (4)             1.7s (4)  -3.9%
     sqlalchemy_imperative2.py             2.3s (4)             2.2s (4)  -2.0%
                pyxl_bench2.py             1.1s (4)             1.1s (4)  -0.1%
       django_template3_10x.py            13.8s (4)            13.8s (4)  -0.2%
             pyxl_bench_10x.py            13.6s (4)            13.5s (4)  -1.0%
 sqlalchemy_imperative2_10x.py            17.4s (4)            16.9s (4)  -2.8%
            pyxl_bench2_10x.py             9.3s (4)             9.2s (4)  -1.7%
                       geomean                 4.8s                 4.7s  -1.6%

@undingen undingen changed the title [WIP] Cheaper frame introspection Cheaper frame introspection Nov 20, 2015

CFG* cfg = source_info->cfg;

// Note: due to some (avoidable) restrictions, this check is pretty constrained in where
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but it seems like this comment should stay where it was?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nevermind, should we just kill the old calculateNumVRegs?

@kmod
Copy link
Collaborator

kmod commented Dec 4, 2015

lgtm :) I think it'd still be good to test loading from the vregs array instead of storing the llvm value -- ie in evalName, have something like loadVregIfUserVisible() instead of returning the llvm::Value that we stored in the symbol_table.

@kmod kmod added the accepted label Dec 4, 2015
@undingen
Copy link
Contributor Author

undingen commented Dec 4, 2015

I added a commit which makes more clear where we call computeCFG.

Concerning loading the values from the vregs. I tried that but it looked like it does not influence the performance. Maybe it will improve the performance when we teach LLVMs AA that the vregs don't get modified when doing a call/patchpoint to an external function and it's able to propagate store/loads.

undingen added a commit that referenced this pull request Dec 4, 2015
@undingen undingen merged commit ff657c1 into pyston:master Dec 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants