-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cheaper frame introspection #1011
Conversation
bba7171
to
029b567
Compare
…ICSetupInfo inside the PatchpointInfo
029b567
to
c96a924
Compare
@kmod will this has currently still at least one problem: it attaches the PASSED_CLOSURE_NAME, PASSED_GLOBALS_NAME, ... as stackmap args on call sites and expect them to not be in registers which get clobbered by a call - looks like this does currently never happen that why it works but it's better to make this more robust by explicitly storing this values either inside the frame_info or vregs array). |
couple random thoughts
Anyway, just some small thoughts, I think the overall approach sounds good :) |
d542dba
to
e218661
Compare
This splits up the handling of deopts and normal frame introspection (e.g. for a traceback). We have to add to nearly all call sites frame introspection which makes it very important that it does not introduce much overhead over a normal call instruction. By always storing the user visible variables into a vregs array (layout the same as in the interpreter/bjit) we can make introspection cheaper. Frame introspection only needs to access user facing variables therefore we don't have to generate extra bytes for spilling variables which get clobbered in the callee because all values we need to access are inside the vregs array. This let's use remove the 95byte overhead and reduces the stackmap size. It adds a slight cost of maintaining the vregs array but we were already doing some of this work before with our manual spilling with the additional benefit of faster frame introspection. The deopts case stays pretty much the same with the exception that we don't add the user visible vars to the stackmap because they are already in the vreg. We could reduce the overhead by implementing a special "deopt()" function in asm which stores and restores all variables thereby we would not have to manualy spill the registers when filling the deopt IC. Alternatively we could handle it inside llvm by either switching to a stackmap intrinsic which already supports this case or adding it it does not exist... But I think it's not worth it because deopts should be uncommen...
3480dd3
to
7b24661
Compare
This PR is now ready for review. Overall the perf change is not very large but I'm still happy with it and confident because this change should mainly help in speeding up not yet merged patches for signal handling, #1007 and allow future optimization. Perf comparison 1. run with cleared cache:
Perf best of 3 runs:
|
|
||
CFG* cfg = source_info->cfg; | ||
|
||
// Note: due to some (avoidable) restrictions, this check is pretty constrained in where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, but it seems like this comment should stay where it was?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nevermind, should we just kill the old calculateNumVRegs?
lgtm :) I think it'd still be good to test loading from the vregs array instead of storing the llvm value -- ie in evalName, have something like |
9faa7ac
to
199448f
Compare
I added a commit which makes more clear where we call Concerning loading the values from the vregs. I tried that but it looked like it does not influence the performance. Maybe it will improve the performance when we teach LLVMs AA that the vregs don't get modified when doing a call/patchpoint to an external function and it's able to propagate store/loads. |
This splits up the handling of deopts and normal frame introspection (e.g. for a traceback).
We have to add to nearly all call sites frame introspection which makes it very important that it does not introduce much overhead over a normal call instruction.
By always storing the user visible variables into a vregs array (layout the same as in the interpreter/bjit) we can make introspection cheaper.
Frame introspection only needs to access user facing variables therefore we don't have to generate extra bytes for spilling variables which get clobbered in the callee because all values we need to access are inside the vregs array.
This let's use remove the 95byte overhead and reduces the stackmap size.
It adds a slight cost of maintaining the vregs array but we were already doing some of this work before with our manual spilling with the additional benefit of faster frame introspection.
The deopts case stays pretty much the same with the exception that we don't add the user visible vars to the stackmap because they are already in the vreg.
We could reduce the overhead by implementing a special "deopt()" function in asm which stores and restores all variables thereby we would not have to manually spill the registers when filling the deopt IC.
Alternatively we could handle it inside llvm by either switching to a stackmap intrinsic which already supports this case or adding it if does not exist...
But I think it's not worth it because deopts should be uncommon...
Before implementing this I thought I would take the "let llvm handle the spilling" approach but I thought that it would introduce so many spills on every call sites that it won't be much faster than explicitly setting the array and it would have had the disadvantage that we don't have a fixed layout which makes future inlining/fast locals() retrieval/etc harder to implement.
I think in the future this approach will also allow inlining of C call sites. The missing piece is just a way to assign a particular IP range to a AST_stmt*. (similar to the EH table mechanism) + changing the source to create a stackmap on the entry of the function which let's us know where the frame_info lives... It maybe possible to just generate a stackmap with the current statement as arg in front and after the call instruction (maybe also add the frame_info or vregs array as arg so that llvm knows it escapes and can't move instructions around).