Support execution of code from external abstract interpreters #52964

vchuravy · 2024-01-18T16:20:23Z

Currently external abstract interpreters are succesfully used to power analysis,
compilation for external targets like GPU, but there currently doesn't exist a
mechanism within Julia to execute the output of these abstract interpreters and
the community has developed various work-arounds.

Enzyme.jl uses GPUCompiler.jl + LLVM.jl to spin up it's own external JIT and
uses ccall to call from Julia native to Enzyme controlled land. It furthermore
needs to detect dynamic callsites and replaces them to functions controlled by Enzyme.
CassetteOverlay.jl uses the "cassette" transform to replace all calls f(args...)
to calls to overdub(ctx, f, args...). While this has worked in Cassette.jl for a
long time, it also leads to hard to read backtraces and overly relies on generated functions.

This PR is a combination of two ideas:

The currently active compiler is a dynamically scoped value (implemented as a dedicated task-local for performance)
Use tagged CodeInstances for AbstractInterpreter caching #52233 unifies the cache infrastructure for native and external compilers, thus allows for easy querying inside the runtime and compiler

Concretly this PR introduces abstract type CompilerInstance end that allows for the creation of temporary AbstractInterpreter instances,
it then uses a new built-in call_within to switch between compiler instances, furthermore the compiler instance is also the owner of
the corresponding CodeInstances.

After that it is mostly and exercise of threading the compiler instance through in the right places.

TODO

Go through all uses of explicit jl_nothing as a owner token and use the correct one.
Expose the compiler instance to reflection methods
Add abstract interpreter support for call_within
Concrete evaluation will need to use call_within
Interpreter support? JuliaInterpreter.jl support? Cthulhu support?
Finalizers?

Example: Cassette style tracer

This PR doesn't do anything about how to work with the IR and write compiler plugins,
it also doesn't provide any hooks for compiler instances to modify the LLVM pipeline,
but a below is a prototype of a Cassette type tracer. Where we have a prehook
and a posthook function to execute over the callgraph.

const CC = Core.Compiler

import .CC: SSAValue, GlobalRef

struct Tracer <: CC.AbstractCompiler end
CC.abstract_interpreter(::Tracer, world::UInt) =
    TracerInterp(; world)

struct TracerInterp <: CC.AbstractInterpreter
    world::UInt
    inf_params::CC.InferenceParams
    opt_params::CC.OptimizationParams
    inf_cache::Vector{CC.InferenceResult}
    code_cache::CC.InternalCodeCache
    compiler::Tracer
    function TracerInterp(;
                world::UInt = Base.get_world_counter(),
                compiler::Tracer = Tracer(),
                inf_params::CC.InferenceParams = CC.InferenceParams(),
                opt_params::CC.OptimizationParams = CC.OptimizationParams(),
                inf_cache::Vector{CC.InferenceResult} = CC.InferenceResult[],
                code_cache::CC.InternalCodeCache = CC.InternalCodeCache(compiler))
        return new(world, inf_params, opt_params, inf_cache, code_cache, compiler)
    end
end

CC.InferenceParams(interp::TracerInterp) = interp.inf_params
CC.OptimizationParams(interp::TracerInterp) = interp.opt_params
CC.get_world_counter(interp::TracerInterp) = interp.world
CC.get_inference_cache(interp::TracerInterp) = interp.inf_cache
CC.code_cache(interp::TracerInterp) = CC.WorldView(interp.code_cache, CC.WorldRange(interp.world))
CC.cache_owner(interp::TracerInterp) = interp.compiler

import Core.Compiler: retrieve_code_info, maybe_validate_code
# Replace usage sited of `retrieve_code_info`, OptimizationState is one such, but in all interesting use-cases
# it is derived from an InferenceState. There is a third one in `typeinf_ext` in case the module forbids inference.
function CC.InferenceState(result::CC.InferenceResult, cache_mode::UInt8, interp::TracerInterp)
    world = CC.get_world_counter(interp)
    src = retrieve_code_info(result.linfo, world)
    src === nothing && return nothing
    maybe_validate_code(result.linfo, src, "lowered")
    src = transform(interp, result.linfo, src)
    maybe_validate_code(result.linfo, src, "transformed")
    return CC.InferenceState(result, src, cache_mode, interp)
end

##
# Cassette style early transform
##

# Allows for Cassette Pass transforms

function static_eval(mod, name)
    if Base.isbindingresolved(mod, name) && Base.isdefined(mod, name)
        return getfield(mod, name)
    else
        return nothing
    end
end

function prehook end
function posthook end

function transform(interp, mi, src)
    method = mi.def
    f = static_eval(method.module, method.name)
    ccall(:jl_, Cvoid, (Any,), mi)
    if f === Core._apply
        return src
    end
    if f isa Core.Builtin
        error("Transforming builtin")
    end
    # if f === prehook || f === posthook
    #     return src
    # end
    ci = copy(src)
    transform!(mi, ci)
    ci.ssavaluetypes = length(ci.code)
    # XXX we need to copy flags
    ci.ssaflags = [0x00 for _ in 1:length(ci.code)]
    # ccall(:jl_, Cvoid, (Any,), ci)
    return ci
end

function ir_element(x, code::Vector)
    while isa(x, Core.SSAValue)
        x = code[x.id]
    end
    return x
end

"""
    is_ir_element(x, y, code::Vector)

Return `true` if `x === y` or if `x` is an `SSAValue` such that
`is_ir_element(code[x.id], y, code)` is `true`.
See also: [`replace_match!`](@ref), [`insert_statements!`](@ref)
"""
function is_ir_element(x, y, code::Vector)
    result = false
    while true # break by default
        if x === y #
            result = true
            break
        elseif isa(x, Core.SSAValue)
            x = code[x.id]
        else
            break
        end
    end
    return result
end

"""
    insert_statements!(code::Vector, codelocs::Vector, stmtcount, newstmts)


For every statement `stmt` at position `i` in `code` for which `stmtcount(stmt, i)` returns
an `Int`, remove `stmt`, and in its place, insert the statements returned by
`newstmts(stmt, i)`. If `stmtcount(stmt, i)` returns `nothing`, leave `stmt` alone.

For every insertion, all downstream `SSAValue`s, label indices, etc. are incremented
appropriately according to number of inserted statements.

Proper usage of this function dictates that following properties hold true:

- `code` is expected to be a valid value for the `code` field of a `CodeInfo` object.
- `codelocs` is expected to be a valid value for the `codelocs` field of a `CodeInfo` object.
- `newstmts(stmt, i)` should return a `Vector` of valid IR statements.
- `stmtcount` and `newstmts` must obey `stmtcount(stmt, i) == length(newstmts(stmt, i))` if
    `isa(stmtcount(stmt, i), Int)`.

To gain a mental model for this function's behavior, consider the following scenario. Let's
say our `code` object contains several statements:
code = Any[oldstmt1, oldstmt2, oldstmt3, oldstmt4, oldstmt5, oldstmt6]
codelocs = Int[1, 2, 3, 4, 5, 6]

Let's also say that for our `stmtcount` returns `2` for `stmtcount(oldstmt2, 2)`, returns `3`
for `stmtcount(oldstmt5, 5)`, and returns `nothing` for all other inputs. From this setup, we
can think of `code`/`codelocs` being modified in the following manner:
newstmts2 = newstmts(oldstmt2, 2)
newstmts5 = newstmts(oldstmt5, 5)
code = Any[oldstmt1,
           newstmts2[1], newstmts2[2],
           oldstmt3, oldstmt4,
           newstmts5[1], newstmts5[2], newstmts5[3],
           oldstmt6]
codelocs = Int[1, 2, 2, 3, 4, 5, 5, 5, 6]

See also: [`replace_match!`](@ref), [`is_ir_element`](@ref)
"""
function insert_statements!(code, codelocs, stmtcount, newstmts)
    ssachangemap = fill(0, length(code))
    labelchangemap = fill(0, length(code))
    worklist = Tuple{Int,Int}[]
    for i in 1:length(code)
        stmt = code[i]
        nstmts = stmtcount(stmt, i)
        if nstmts !== nothing
            addedstmts = nstmts - 1
            push!(worklist, (i, addedstmts))
            ssachangemap[i] = addedstmts
            if i < length(code)
                labelchangemap[i + 1] = addedstmts
            end
        end
    end
    Core.Compiler.renumber_ir_elements!(code, ssachangemap, labelchangemap)
    for (i, addedstmts) in worklist
        i += ssachangemap[i] - addedstmts # correct the index for accumulated offsets
        stmts = newstmts(code[i], i)
        @assert length(stmts) == (addedstmts + 1)
        code[i] = stmts[end]
        for j in 1:(length(stmts) - 1) # insert in reverse to maintain the provided ordering
            insert!(code, i, stmts[end - j])
            insert!(codelocs, i, codelocs[i])
        end
    end
end

function transform!(mi, src)
    stmtcount = (x, i) -> begin
        isassign = Base.Meta.isexpr(x, :(=))
        stmt = isassign ? x.args[2] : x
        if Base.Meta.isexpr(stmt, :call)
            return 4
        end
        return nothing
    end
    newstmts = (x, i) -> begin
        callstmt = Base.Meta.isexpr(x, :(=)) ? x.args[2] : x
        isapplycall = is_ir_element(callstmt.args[1], GlobalRef(Core, :_apply), src.code)
        isapplyiteratecall = is_ir_element(callstmt.args[1], GlobalRef(Core, :_apply_iterate), src.code)
        if isapplycall || isapplyiteratecall
            callf = callstmt.args[2]
            callargs = callstmt.args[3:end]
            stmts = Any[
                Expr(:call,
                     GlobalRef(Core, :_call_within), nothing,
                     prehook, callf, callargs...),
                callstmt,
                Expr(:call,
                     GlobalRef(Core, :_call_within), nothing,
                     posthook, SSAValue(i + 1), callf, callargs...),
                Base.Meta.isexpr(x, :(=)) ? Expr(:(=), x.args[1], SSAValue(i + 1)) : SSAValue(i + 1)
            ]
        else
            stmts = Any[
                Expr(:call, GlobalRef(Core, :_call_within), nothing, prehook, callstmt.args...),
                callstmt,
                Expr(:call, GlobalRef(Core, :_call_within), nothing, posthook, SSAValue(i + 1), callstmt.args...),
                Base.Meta.isexpr(x, :(=)) ? Expr(:(=), x.args[1], SSAValue(i + 1)) : SSAValue(i + 1)
            ]
        end
        return stmts
    end
    insert_statements!(src.code, src.codelocs, stmtcount, newstmts)
    return nothing
end

# TODO:
# - anything called from prehook is still instrumented
#   - either we can stop instrumentation / or switch to native (high overhead)
# - Similar problem with invokelatest, we somehow got recursion
# - invokelatest trace sees prehook call -- do we double transform?

struct Call
    parent
    f
    args
    children
end

# TODO: Handle task-safety
const call_tree = ScopedValue{Ref{Call}}()

function prehook(f, args...)
    parent = call_tree[][]
    current = Call(parent, f, args, Call[])
    push!(parent.children, current)
    call_tree[][] = current
end

function posthook(_, f, args...)
    current = call_tree[][]
    call_tree[][] = current.parent
end

function f()
end

function trace(f, args...)
    top = Call(nothing, f, args, Call[])
    @with call_tree => Ref(top) begin
        Base.invoke_within(Tracer(), f, args...)
    end
    return top
end

trace(f)

vchuravy · 2024-01-18T19:28:10Z

One thing I particularly enjoyed about Cassette was that composition is well defined. This proposal ignores composition, compiler instances don't form a stack and there is no expectation that the output of one is meant to be consumed by another.

For Cassette uninferred IR was a viable communication layer, but with compiler instances customization can occur along many levels and we run into the pipeline ordering problem. The hope would be that using compiler instance we could build an actual "compiler plugins" infrastructure that allows for the registration of passes/intrinsics and provides sensible composition, but that seems further away and I do think we need some experimentation with compiler customization first before we tackle that.

1. Introduced a task-local inherited (nee ScopedValue) compiler field 2. Allow compilation and execution of code generated by foreign abstract interpreters A new primitive `call_within` is introduced that switches the compiler instance. The compiler instance is used for cache-lookups, compilation request, and dispatch.

vchuravy · 2024-04-24T19:37:23Z

test/compiler/plugins.jl

+
+# FIXME: Currently doesn't infer and ends in "Skipped call_within since compiler plugin not constant"
+overlay(f, args...) = CustomMethodTables.overlay(CustomMT, f, args...)
+@test_broken overlay(sin, 1.0) == cos(1.0) # Bug in inference, not using the method_table for initial lookup


This is due to jl_lookup_generic not taking into account the custom method table.

This is an alternative mechanism to #56650 that largely achieves the same result, but by hooking into `invoke` rather than a generated function. They are orthogonal mechanisms, and its possible we want both. However, in #56650, both Jameson and Valentin were skeptical of the generated function signature bottleneck. This PR is sort of a hybrid of mechanism in #52964 and what I proposed in #56650 (comment). In particular, this PR: 1. Extends `invoke` to support a CodeInstance in place of its usual `types` argument. 2. Adds a new `typeinf` optimized generic. The semantics of this optimized generic allow the compiler to instead call a companion `typeinf_edge` function, allowing a mid-inference interpreter switch (like #52964), without being forced through a concrete signature bottleneck. However, if calling `typeinf_edge` does not work (e.g. because the compiler version is mismatched), this still has well defined semantics, you just don't get inference support. The additional benefit of the `typeinf` optimized generic is that it lets custom cache owners tell the runtime how to "cure" code instances that have lost their native code. Currently the runtime only knows how to do that for `owner == nothing` CodeInstances (by re-running inference). This extension is not implemented, but the idea is that the runtime would be permitted to call the `typeinf` optimized generic on the dead CodeInstance's `owner` and `def` fields to obtain a cured CodeInstance (or a user-actionable error from the plugin). This PR includes an implementation of `with_new_compiler` from #56650. This PR includes just enough compiler support to make the compiler optimize this to the same code that #56650 produced: ``` julia> @code_typed with_new_compiler(sin, 1.0) CodeInfo( 1 ─ $(Expr(:foreigncall, :(:jl_get_tls_world_age), UInt64, svec(), 0, :(:ccall)))::UInt64 │ %2 = builtin Core.getfield(args, 1)::Float64 │ %3 = invoke sin(%2::Float64)::Float64 └── return %3 ) => Float64 ``` However, the implementation here is extremely incomplete. I'm putting it up only as a directional sketch to see if people prefer it over #56650. If so, I would prepare a cleaned up version of this PR that has the optimized generics as well as the curing support, but not the full inference integration (which needs a fair bit more work).

This is an alternative mechanism to #56650 that largely achieves the same result, but by hooking into `invoke` rather than a generated function. They are orthogonal mechanisms, and its possible we want both. However, in #56650, both Jameson and Valentin were skeptical of the generated function signature bottleneck. This PR is sort of a hybrid of mechanism in #52964 and what I proposed in #56650 (comment). In particular, this PR: 1. Extends `invoke` to support a CodeInstance in place of its usual `types` argument. 2. Adds a new `typeinf` optimized generic. The semantics of this optimized generic allow the compiler to instead call a companion `typeinf_edge` function, allowing a mid-inference interpreter switch (like #52964), without being forced through a concrete signature bottleneck. However, if calling `typeinf_edge` does not work (e.g. because the compiler version is mismatched), this still has well defined semantics, you just don't get inference support. The additional benefit of the `typeinf` optimized generic is that it lets custom cache owners tell the runtime how to "cure" code instances that have lost their native code. Currently the runtime only knows how to do that for `owner == nothing` CodeInstances (by re-running inference). This extension is not implemented, but the idea is that the runtime would be permitted to call the `typeinf` optimized generic on the dead CodeInstance's `owner` and `def` fields to obtain a cured CodeInstance (or a user-actionable error from the plugin). This PR includes an implementation of `with_new_compiler` from #56650. That said, this PR does not yet include the compiler optimization that implements the semantics of the optimized generic, which will be in a follow up PR.

This is an alternative mechanism to #56650 that largely achieves the same result, but by hooking into `invoke` rather than a generated function. They are orthogonal mechanisms, and its possible we want both. However, in #56650, both Jameson and Valentin were skeptical of the generated function signature bottleneck. This PR is sort of a hybrid of mechanism in #52964 and what I proposed in #56650 (comment). In particular, this PR: 1. Extends `invoke` to support a CodeInstance in place of its usual `types` argument. 2. Adds a new `typeinf` optimized generic. The semantics of this optimized generic allow the compiler to instead call a companion `typeinf_edge` function, allowing a mid-inference interpreter switch (like #52964), without being forced through a concrete signature bottleneck. However, if calling `typeinf_edge` does not work (e.g. because the compiler version is mismatched), this still has well defined semantics, you just don't get inference support. The additional benefit of the `typeinf` optimized generic is that it lets custom cache owners tell the runtime how to "cure" code instances that have lost their native code. Currently the runtime only knows how to do that for `owner == nothing` CodeInstances (by re-running inference). This extension is not implemented, but the idea is that the runtime would be permitted to call the `typeinf` optimized generic on the dead CodeInstance's `owner` and `def` fields to obtain a cured CodeInstance (or a user-actionable error from the plugin). This PR includes an implementation of `with_new_compiler` from #56650. This PR includes just enough compiler support to make the compiler optimize this to the same code that #56650 produced: ``` julia> @code_typed with_new_compiler(sin, 1.0) CodeInfo( 1 ─ $(Expr(:foreigncall, :(:jl_get_tls_world_age), UInt64, svec(), 0, :(:ccall)))::UInt64 │ %2 = builtin Core.getfield(args, 1)::Float64 │ %3 = invoke sin(%2::Float64)::Float64 └── return %3 ) => Float64 ``` However, the implementation here is extremely incomplete. I'm putting it up only as a directional sketch to see if people prefer it over #56650. If so, I would prepare a cleaned up version of this PR that has the optimized generics as well as the curing support, but not the full inference integration (which needs a fair bit more work).

vchuravy added speculative Whether the change will be implemented is speculative compiler:plugins labels Jan 18, 2024

vchuravy force-pushed the vc/tagged_ci branch 2 times, most recently from 408391e to ef35f71 Compare February 7, 2024 21:33

aviatesk force-pushed the vc/tagged_ci branch from ef35f71 to 4f975b0 Compare February 8, 2024 16:16

Base automatically changed from vc/tagged_ci to master February 10, 2024 22:35

vchuravy force-pushed the vc/compiler_instance_v2 branch from 1726c2d to 2ce7b5e Compare February 11, 2024 21:34

vchuravy mentioned this pull request Feb 12, 2024

allow external absint to hold custom data in codeinst.inferred #53300

Merged

vchuravy force-pushed the vc/compiler_instance_v2 branch from 887526f to e825798 Compare February 22, 2024 01:29

vchuravy force-pushed the vc/compiler_instance_v2 branch from 0aaf405 to 4f16213 Compare March 21, 2024 19:35

vchuravy added 6 commits March 21, 2024 17:53

add CustomMethodTables plugin

56ed663

fixup! Scoped execution for foreign AbsInt code

5a7cf19

fixup! add CustomMethodTables plugin

36a0fcf

fixup! add CustomMethodTables plugin

219f278

fixup! add CustomMethodTables plugin

d6503ea

fixup! fixup! Scoped execution for foreign AbsInt code

4bf99ea

vchuravy commented Apr 24, 2024

View reviewed changes

WIP: allow retrieve_code_info to be intercepted

6d9f55d

vchuravy mentioned this pull request May 6, 2024

RFC: Refactor MethodInstance to allow for more general specialization #54373

Closed

fixup! WIP: allow retrieve_code_info to be intercepted

df26fc8

vchuravy mentioned this pull request Jul 16, 2024

[Experiment] Add deferred_with JuliaGPU/GPUCompiler.jl#599

Closed

vchuravy mentioned this pull request Sep 5, 2024

Prototype CompilerPlugins in GPUCompiler 😱 JuliaGPU/GPUCompiler.jl#625

Closed

vchuravy mentioned this pull request Nov 22, 2024

WIP: Allow generated functions to return a CodeInstance #56650

Closed

Keno mentioned this pull request Nov 23, 2024

Extend invoke to accept CodeInstance #56660

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support execution of code from external abstract interpreters #52964

Support execution of code from external abstract interpreters #52964

vchuravy commented Jan 18, 2024 •

edited

Loading

vchuravy commented Jan 18, 2024

vchuravy Apr 24, 2024 •

edited

Loading

Support execution of code from external abstract interpreters #52964

Are you sure you want to change the base?

Support execution of code from external abstract interpreters #52964

Conversation

vchuravy commented Jan 18, 2024 • edited Loading

TODO

Example: Cassette style tracer

vchuravy commented Jan 18, 2024

vchuravy Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

vchuravy commented Jan 18, 2024 •

edited

Loading

vchuravy Apr 24, 2024 •

edited

Loading