-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The GC root placement pass 1.0 deserves #21888
Conversation
Sounds really nice. Does this have any implications for temporary buffer reuse? (For example, if the root can be placed in the caller...) I'm thinking of functions along the lines of function mymedian(v)
vc = copy(v)
sort!(vc) # just for illustration, this is not optimal...
inds = linearindices(vc)
vc[(first(inds)+last(inds)) ÷ 2]
end for which, if you call it repeatedly with many small vectors, currently exacts a non-trivial GC cost that can be mitigated by pre-allocating a temporary buffer and passing it in as an argument. |
@@ -2362,13 +2370,15 @@ static void simple_escape_analysis(jl_value_t *expr, bool esc, jl_codectx_t *ctx | |||
// Emit a gc-root slot indicator | |||
static Value *emit_local_root(jl_codectx_t *ctx, jl_varinfo_t *vi) | |||
{ | |||
CallInst *newroot = CallInst::Create(prepare_call(gcroot_func), "", /*InsertBefore*/ctx->ptlsStates); | |||
Instruction *newroot = new AllocaInst(T_prjlvalue, 0, "gcroot", /*InsertBefore*/ctx->ptlsStates); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for performance we will want to do:
#ifndef NDEBUG
newroot->setName("gcroot");
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably.
src/codegen.cpp
Outdated
int slot = 0; | ||
assert(args.size() > 0); | ||
auto lifetime_start = Intrinsic::getDeclaration(jl_Module, Intrinsic::lifetime_start, {T_pprjlvalue}); | ||
builder.CreateCall(lifetime_start, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLVM37+ syntax here, but elsewhere you seem to be maintaining LLVM33 support. Seems like the PR should just be consistent about dropping support?
} | ||
|
||
JL_FEAT_REQUIRE(ctx, runtime); | ||
Value *varg1 = boxed(arg1, ctx); | ||
Value *varg2 = boxed(arg2, ctx, false); // potentially unrooted! | ||
Value *varg1 = mark_callee_rooted(boxed(arg1, ctx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't callee-rooted (and the creation of varg2
could allocate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine, the pass will still keep it live at any safepoint that it's life across. It just won't put it in a gc root if a safepoint terminates the live interval.
src/codegen.cpp
Outdated
@@ -2693,12 +2720,13 @@ static bool emit_builtin_call(jl_cgval_t *ret, jl_value_t *f, jl_value_t **args, | |||
} | |||
if (jl_subtype(ty, (jl_value_t*)jl_type_type)) { | |||
*ret = emit_expr(args[1], ctx); | |||
Value *rt_ty = boxed(emit_expr(args[2], ctx), ctx); | |||
Value *rt_ty = mark_callee_rooted(boxed(emit_expr(args[2], ctx), ctx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't callee-rooted unless boxed
tells you that it was
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above. The placement pass figures it out. The callee rooted annotation is just to annotate that the only thing you're allowed to do with it is pass it to a function that callee roots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but type-assert also doesn't root this, so it does need to be rooted at the call safepoint by the caller / codegen
src/codegen.cpp
Outdated
retval.V), | ||
retval.gcroot); | ||
builder.CreateStore(box, | ||
retval.gcroot); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignment here seems odd. seems like this could fit on the one line
@@ -129,12 +130,16 @@ void addOptimizationPasses(PassManager *PM) | |||
if (jl_options.opt_level == 0) { | |||
PM->add(createCFGSimplificationPass()); // Clean up disgusting code | |||
PM->add(createMemCpyOptPass()); // Remove memcpy / form memset | |||
PM->add(createLowerPTLSPass(imaging_mode)); | |||
#if JL_LLVM_VERSION >= 40000 | |||
PM->add(createAlwaysInlinerLegacyPass()); // Respect always_inline | |||
#else | |||
PM->add(createAlwaysInlinerPass()); // Respect always_inline | |||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we actually confident that this can't break your gc-invariants and end up trying to run that pass on external code that it wasn't supposed to try to run on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It only looks at our address spaces. If external code uses our address spaces, it needs to follow our invariants.
src/jitlayers.cpp
Outdated
@@ -489,8 +501,13 @@ JuliaOJIT::JuliaOJIT(TargetMachine &TM) | |||
addOptimizationPasses(&PM); | |||
} | |||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we write this as addOptimizationPasses(&PM, /* -O */0)
, and make the jl_options.opt_level
option to that function explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
doc/src/devdocs/llvm.md
Outdated
``` | ||
The GC root placement pass will treat the jl_roots operand bundle as if it were | ||
a regular operand. However, as a final step, after the gc roots are inserted, | ||
it will drop the operand bundle to avoid confusing codegen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"avoid confusing machine instruction selection"?
Yes and no. It makes it a lot easier to do that kind of transformation at the LLVM level, as well as making LLVM-level inlining easier. Those two combined could easily achieve what you want in your example. However, by itself this doesn't really do much there. |
doc/src/devdocs/llvm.md
Outdated
|
||
Minimize the number of needed gc roots/stores to them subject to the constraint | ||
that at every safepoint, any live gc-tracked pointer (i.e. is a path after this | ||
point that contains a use of this pointer). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, should have is in a gc slot
;).
for the function. As a result, the external rooting must be arranged while the | ||
value is still tracked by the system. I.e. it is not valid to attempt use the | ||
result of this operation to establish a global root - the optimizer may have | ||
already dropped the value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make any sense for pointer_from_objref
to return a pointer in a GC-aware address space? Would that allow LLVM to automatically make code using such pointers safe(r)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the whole point of that function is to escape the GC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This escape semantic would also be required so that we can stack allocate RefValue{bitstype}
in ccall.
src/llvm-late-gc-lowering.cpp
Outdated
|
||
Minimize the number of needed gc roots/stores to them subject to the constraint | ||
that at every safepoint, any live gc-tracked pointer (i.e. is a path after this | ||
point that contains a use of this pointer). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same incomplete sentence.
src/llvm-late-gc-lowering.cpp
Outdated
This step performs necessary cleanup before passing the IR to codegen. In | ||
particular, it removes any calls to julia_from_objref intrinsics and | ||
removes the extra operand bundles from ccalls. In the future it could | ||
also strip the addressspace information from all values as this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think "address space" is one word
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the llvm annotation is addrspace
, so I should probably just write that.
Great documentation and comments, btw. This is the standard we should all strive for :) |
8d19b38
to
eb49b3d
Compare
Rebased and backported to 3.9.1. On 3.9.1, it needs to run very early in the pipeline, so the improvements are not nearly as large. So I guess just think that that as an incentive to move to 5.0 post haste. I don't intend to support any prior LLVM version. |
We should definitely move master to LLVM 5 ASAP since we know that causes all sorts of havoc and needs a long time to settle into stability. |
It's also a good thing to do this before we introduce too many other incompatibilities: the package ecosystem is a good test bed for shaking out LLVM bugs, but most people won't update packages to 0.7 until 0.7 is looking imminent. Consequently, the longer we wait, the more packages will be untestable for reasons unrelated to the LLVM update. |
Fixed the 32bit build. The windows build seems to be failing because the LLVM patches I added here aren't being applied. @tkelman how do we rebuild the windows CI LLVM binaries? |
I have to build them by hand, though now that we build llvm as a shared library we could maybe get them from the nightlies (with some fiddling needed for headers?) |
Not sure what's up with the OS X build. Passes locally. I suspect it'll also need an LLVM version bump. |
OS X travis gets its llvm binary from the juliadeps homebrew tap |
@tkelman @staticfloat the patches here have been merged on master. Can we build new copies of LLVM for Win/OSX from the patchset that's currently on master? |
@staticfloat let me know when the bottles are built so I can kick off CI again here. |
Hmmm. Compilation failed:
Is that coming from one of our patches? This is being built with |
|
Ah, I see what's going on. We have further patches we need to include to get rid of the Is there any reason I should get |
That's starting to look much better. Still a couple left to fix, though. I know this is a pain to rebase, but please don't merge too quickly. There's many new commits (and llvm passes) on here since it's been last reviewed. |
src/llvm-propagate-addrspaces.cpp
Outdated
CI->setMetadata(MD.first, MD.second); | ||
CI->setDebugLoc(MI.getDebugLoc()); | ||
#endif | ||
ToInsert.push_back(std::make_pair(CI, &MI)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
src/llvm-propagate-addrspaces.cpp
Outdated
for (const auto &MD : TheMDs) | ||
CI->setMetadata(MD.first, MD.second); | ||
CI->setDebugLoc(MTI.getDebugLoc()); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
src/llvm-propagate-addrspaces.cpp
Outdated
Value *TheFn = Intrinsic::getDeclaration(MTI.getModule(), MTI.getIntrinsicID(), | ||
{Dest->getType(), Src->getType(), | ||
MTI.getOperand(2)->getType()}); | ||
CallInst *CI = CallInst::Create(TheFn, {Dest, Src, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'll be better to just do MTI->setCalledFunction(TheFn); setArgOperand(0, Dest); setArgOperand(1, Src);
. That way, there's fewer things that need to be moved over to get wrong / outdated.
http://llvm.org/docs/doxygen/html/classllvm_1_1CallInst.html#a61f747edd1ca001427e02e02a320b709
src/llvm-propagate-addrspaces.cpp
Outdated
for (;it != GEP->op_end(); ++it) { | ||
Operands.push_back(*it); | ||
} | ||
NewGEP->mutateType(GetElementPtrInst::getGEPReturnType(GEPTy, CurrentV, Operands)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this clone mutation may be clearer / simpler if expressed as NewGEP->mutateType(PointerType::getUnqual(cast<PointerType>(GEP->getType())->getElementType(), 0))
Design notes are in the devdocs. Algorithmic documentation in code comments.
Details are in the devdocs. This scheme is signfificantly simpler.
Some LLVM passes don't handle addrspacecast too well, so try to minimize addrspace cast transitions where legal according to our invariants.
In the previous iteration of this code, timing would be double counted. This way, the individual passes are correctly registered with the top level manager, so timing works properly. It's a bit hacky but with the legacy pass manager, there's not much else to be done.
Also avoid numbering arguments early in the pipeline. Improves performance on small test cases without safepoints.
I've squashed some of the cleanup commits in the middle, addressed the latest round of feedback and addressed the latest rounds of comments. There shouldn't really be any changes to the generated code, but I'll let CI run through just in case (also in case there's interactions with master). I do plan to merge this as soon as CI is through though. |
GetElementPtrInst *NewGEP = cast<GetElementPtrInst>(GEP->clone()); | ||
ToInsert.push_back(std::make_pair(NewGEP, GEP)); | ||
Type *GEPTy = GEP->getSourceElementType(); | ||
Type *NewRetTy = cast<PointerType>(GEP->getType())->getElementType()->getPointerTo(getValueAddrSpace(CurrentV)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're putting an addrspace on this, there should be one on the BitCast below too, otherwise the BitCast inst would be attempting to change AddrSpace also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice the adddrspace is always 0 (on non-gpu targets anyway), so it's fine, but sure, I'll add the addrspace to the bitcast as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After merging that is. Otherwise, by the time we're through the CI queue again I'll have to rebase yet again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few minor typos etc, can be addressed in either a ci skip commit or later followup
jlcall calling convention. This allows us to retain the SSA-ness of the | ||
uses throughout the optimizer. GC root placement will later lower this call to | ||
the original C ABI. In the code the calling convention number is represented by | ||
the `JLCALL_F_CC` constant. In addition, there ist the `JLCALL_CC` calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is the
|
||
## GC root placement | ||
|
||
GC root placement is done by an LLVM late in the pass pipeline. Doing GC root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an LLVM pass late in the pipeline
placement this late enables LLVM to make more aggressive optimizations around | ||
code that requires GC roots, as well as allowing us to reduce the number of | ||
required GC roots and GC root store operations (since LLVM doesn't understand | ||
our GC, it wouldn't otherwise know what it is and is not allowed to do with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not allowed to do what?
to always be discardable without altering the semantics of the program. However, | ||
failing to identify a gc-tracked pointer alters the resulting program behavior | ||
dramatically - it'll probably crash or return wrong results. We currently use | ||
three different addressspaces (their numbers are defined in src/codegen_shared.cpp): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
address spaces (used as 2 words in the rest of this paragraph)
|
||
First, only the following addressspace casts are allowed | ||
- 0->{Tracked,Derived,CalleeRooted}: It is allowable to decay an untracked pointer to any of the | ||
other. However, do note that the optimizer has broad license to not root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
others
|
||
4. GC Root coloring | ||
|
||
Two values which are not simulataneously live at a safepoint can share the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simultaneously
Two values which are not simulataneously live at a safepoint can share the | ||
same slot. This is an important optimization, because otherwise long | ||
functions would have exceptionally large GC slots, reducing performance | ||
and bloating the size of the stack. Assigning values to these slots is, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no comma after is
(this is beneficial for code where the primary path does not have | ||
safepoints, but some other path - e.g. the error path does). However, | ||
if the first safepoint is not dominated by the definition (this can | ||
happen due to the non-ssa corner cases), the store is insert right after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is inserted
of the algorithm and their operands as uses of those values. It is | ||
important to consider however WHERE the uses of PHI's operands are | ||
located. It is neither at the start of the basic block, because the values | ||
do not dominated the block (so can't really consider them live-in), nor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not dominate
%Bbase, %B] | ||
|
||
We then pretend, for the purposes of numbering that %phi was derived from | ||
%philift. Note that in order to be able to this, we need to be able to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in order to be able to do this,
Will do various cleanups in a followup PR. |
As of the merge of #21888, we no longer support building on LLVM <= 3.9.1. This is essentialyl the mechanical cleanup to rip out all code that we had to support building on older LLVM versions. There may still be some residual support left in places and of course some things can now be cleaned up further, but this should get us started.
As of the merge of #21888, we no longer support building on LLVM <= 3.9.1. This is essentially the mechanical cleanup to rip out all code that we had to support building on older LLVM versions. There may still be some residual support left in places and of course some things can now be cleaned up further, but this should get us started.
As of the merge of #21888, we no longer support building on LLVM <= 3.9.1. This is essentially the mechanical cleanup to rip out all code that we had to support building on older LLVM versions. There may still be some residual support left in places and of course some things can now be cleaned up further, but this should get us started.
Item 1/2 on my 1.0 must have list:
This is a principled rewrite of our GC root placement pass to occur late in the pass pipeline rather than early as well as produce a more optimal placement of GC roots. Additionally, it will allow us to perform much more general transformations at the LLVM stage, because we no longer have to worry about potentially introducing unrooted values, etc. It also allows the LLVM optimizer itself to do significantly more aggressive optimizations in the presence of boxed objects.
Currently this is written again LLVM 5.0 and does use features that are new to that release, as well as the following LLVM patches:
https://reviews.llvm.org/D32593
https://reviews.llvm.org/D33110
https://reviews.llvm.org/D33129
https://reviews.llvm.org/D33179
I do think however that I can backport this to 4.0, albeit inserted closer the start of the pipeline. That wouldn't give quite as much benefit, but it should be equivalent or slightly better to what we have right now.
I have focused on correctness over compile time performance in the implementation here and I know the algorithm is not optimal (it's fair to say it's pretty naive). Additionally, there is a significant number of follow on optimizations, as well as codegen cleanup that is yet to be done. However, this is already a pretty big change, so I'm saving those for after this is merged.
Design notes are in the devdocs, implementation notes as code comments in the placement pass. There's also tests for all the tricky corner cases I found.
Fixes #20981
Fixes #15369
Supersedes #21158