Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

codegen support for efficient union representations #20593

Merged
merged 9 commits into from
Feb 22, 2017
Merged

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Feb 13, 2017

This implements the codegen primitives and return-type calling convention to make moving around (generating, storing, inspecting, returning) inferred union types more efficient.

The representation of the tagged-union is as a pair consisting of a memory pointer of unspecified size and a byte describing its type. The memory pointer may also be a jl_value_t* box, which will be reflected in it's type tag. For more extensive details, see the commit message.

@vtjnash vtjnash requested a review from JeffBezanson February 13, 2017 04:26
@JeffBezanson
Copy link
Member

Cool! First request: copy that commit message into the devdocs somewhere :)

@johnmyleswhite
Copy link
Member

So excited to see this being worked on. Am I right to think there are still performance optimizations that can be made? For now, I'm using the following snippet to check the performance of union types:

function stable(n)
    s = 0.0
    for i in 1:n
        s += sin(1.0)
    end
    return s
end

function unstable(n)
    s = 0
    for i in 1:n
        s += sin(1.0)
    end
    return s
end

@time stable(10_000_000)
@time unstable(10_000_000)

Currently this branch produces very similar behavior as Julia 0.5 did: the union type of s in unstable causes about a 2x latency degradation and allocates a lot of memory.

@andyferris
Copy link
Member

Very cool, Jameson. I'm guessing from the OP and John's comment that the remaining half of work is to make unions of two isbits types isbits themselves? Would these be laid out be like a variant/Switch, or overlapped like a C union?

@vtjnash
Copy link
Member Author

vtjnash commented Feb 13, 2017

That's a boring example. I can cut down the allocation count, but they don't really cost any time relative to sin. The performance variation seen here is entirely due to a processor bug / feature. If you @profile both, you'll see that >99% of time in all versions is spent in the sin kernel:

julia> @time stable(10_000_000) # comparison
  0.153794 seconds (5 allocations: 176 bytes)
8.414709847530957e6

julia> @time unstable(10_000_000) # altered version of PR
  0.318522 seconds (5 allocations: 176 bytes)
8.414709847530957e6

julia> @time unstable(10_000_000) # master last week
  0.365068 seconds (30.00 M allocations: 457.764 MiB, 3.54% gc time)
8.414709847530957e6

julia> @time unstable(10_000_000) # master today
  0.323274 seconds (10.00 M allocations: 152.588 MiB, 2.94% gc time)
8.414709847530957e6

A slightly more interesting version is:

julia> @noinline sumup(x::Int) = iseven(x) ? x / 2 : x + 3
sumup (generic function with 1 method)

julia> function f(c)
         x::Int = c
         while x != 1
           x = sumup(x)
         end
       end
f (generic function with 1 method)

julia> @time for i = 1:10^7; f(166138751); end # altered version of PR
  3.253064 seconds

julia> @time for i = 1:10^7; f(166138751); end # master
 11.707397 seconds (650.00 M allocations: 9.686 GiB, 2.39% gc time)

Although even here it is very difficult to craft a meaningful benchmark since most of the cost is in using a floating point representation, and the cost of allocation is relatively just noise:

julia> @noinline sumup(x) = (xi = unsafe_trunc(Int, x); iseven(xi) ? div(x, 2) : x + 3)
sumup (generic function with 1 method)

julia> function f(c)
         x = c
         while x != 1
           x = sumup(x)
         end
       end

julia> @time for i = 1:10^7; f(166138751); end
  0.977569 seconds

julia> @time for i = 1:10^7; f(166138751.0); end
  4.740574 seconds

@ararslan ararslan added compiler:codegen Generation of LLVM IR and native code types and dispatch Types, subtyping and method dispatch labels Feb 13, 2017
@johnmyleswhite
Copy link
Member

That's a boring example. I can cut down the allocation count, but they don't really cost any time relative to sin. The performance variation seen here is entirely due to a processor bug / feature. If you @profile both, you'll see that >99% of time in all versions is spent in the sin kernel:

I'm confused: if 100% of the cost of evaluating both stable and unstable was the cost of repeatedly evaluating sin, wouldn't the two implementations both finish in 0.15 seconds since they both evaluate sin 10,000,000 times and only differ in steps other than the evaluation of sin? My mental model is that the cost should be total_time = time_in_sin + time_other and that the two functions should share time_in_sin, so there's clearly a lot of time spent in other steps, even if those steps aren't allocations as your example shows. What am I missing from that model of the cost of evaluating unstable?

@andyferris
Copy link
Member

To be honest, I am also a little unclear what those benchmarks are showing...

(also - what is "altered version of PR"? is that work on the isbits case - you say "I can cut down the allocation count"?)

@vtjnash
Copy link
Member Author

vtjnash commented Feb 14, 2017

My mental model is that the cost should be ...

That cost model only works for macro scale. In this case, the cost model is basically P(processor detects the loop) + P(register stalls). And yes, that a probability estimate, not a performance characteristic. There's some factors that are roughly controllable to increase P (fewer branches, smaller loop, predictable memory accesses, the implementation of register dependencies / size of out-of-order pipeline), but it's not really entirely controllable (only Intel knows the implementation details, and they haven't shared anything. although we know that it differs very widely by processor microarchitecture family and node).

(yeah, the altered case is some test code I put together that does some inference re-arrangements to prevent allocation in this case in exchange for worse code in the general case. Since it's just inference work, it doesn't really matter for this PR – I could have gotten the same result just by writing out some of the extra branches in those function manually).

@andyferris
Copy link
Member

Interesting.

Is there some more generic cases that we can test with this branch? Maybe DataArrays would be a good example of the kind of things this PR will solve, but it seems to be broken on v0.6.

@vchuravy
Copy link
Member

On 32bit windows:

julia: /home/travis/build/JuliaLang/julia/src/codegen.cpp:503: jl_cgval_t::jl_cgval_t(const jl_cgval_t&, jl_value_t*, llvm::Value*): Assertion `isboxed || v.typ == typ || tindex' failed.

I could spend a few cycles looking into this if we are pushing this for v0.6.

@andyferris
Copy link
Member

andyferris commented Feb 14, 2017

Hmm... Is this a regression? My benchmark was a bit strange. Let's call this test "DataArrays lite":

immutable NA; end
Base.:+(::NA, ::Int) = NA()
Base.:+(::NA, ::NA) = NA()
Base.:+(::Int, ::NA) = NA()
n = 10_000_000
v1 = Vector{Union{Int,NA}}(n)
v2 = Vector{Union{Int,NA}}(n)
for i = 1:n
    if rand() < 0.6666
        v1[i] = rand(1:100)
    else
        v1[i] = NA()
    end
    if rand() < 0.6666
        v2[i] = rand(1:100)
    else
        v2[i] = NA()
    end
end
v1 + v2

And the benchmark results:

julia> @time v1 + v2;  # v0.5.0
  0.439152 seconds (10.00 M allocations: 228.874 MB, 24.30% gc time)

julia> @time v1 + v2;  # master
  0.510497 seconds (10.00 M allocations: 305.180 MiB, 23.54% gc time)

julia> @time v1 + v2;  # this PR
  0.747177 seconds (10.00 M allocations: 228.887 MiB, 17.05% gc time)

Does this make sense?

When the isbits stuff is done, I was expecting this to be within a factor of two (edit: 6?) of the Vector{Int} speed (0.031480 seconds (7 allocations: 76.294 MB, 6.53% gc time)).

EDIT: pre-allocating the output array v3:

julia> @time map!(+, v3, v1, v2);   # v0.5.0
  0.321886 seconds (10.00 M allocations: 152.580 MB, 2.38% gc time)

julia> @time map!(+, v3, v1, v2);   # master
  0.336431 seconds (10.00 M allocations: 152.580 MiB, 0.92% gc time)

julia> @time map!(+, v3, v1, v2);   # this PR
  0.394033 seconds (10.00 M allocations: 152.580 MiB, 1.68% gc time)

julia> @time map!(+, v3, v1, v2);   # Vector{Int} case, for comparison
  0.026852 seconds (4 allocations: 160 bytes)

@andyferris
Copy link
Member

And finally, by using/extending nullables on v0.5:

function Base.:+(i1::Nullable{Int}, i2::Nullable{Int}) 
    if isnull(i1) || isnull(i2)
        return Nullable{Int}()
    else
        return Nullable(get(i1) + get(i2))
    end
end
v1 = Vector{Nullable{Int}}(n)
v2 = Vector{Nullable{Int}}(n)
v3 = Vector{Nullable{Int}}(n)
for i = 1:n
    if rand() < 0.6666
        v1[i] = Nullable(rand(1:100))
    else
        v1[i] = Nullable{Int}()
    end
    if rand() < 0.6666
        v2[i] = Nullable(rand(1:100))
    else
        v2[i] = Nullable{Int}()
    end
end

yields

julia> @time map!(+, v3, v1, v2);
  0.132412 seconds (4 allocations: 160 bytes)

which is about 6 times slower than the Int case and about 3 times faster than this PR currently (I didn't attempt to optimize that kernel for +).

@andreasnoack
Copy link
Member

Maybe DataArrays would be a good example of the kind of things this PR will solve, but it seems to be broken on v0.6.

You can try this PR JuliaStats/DataArrays.jl#235

@vtjnash
Copy link
Member Author

vtjnash commented Feb 14, 2017

@vchuravy I already looked at it – the assertion is wrong. I just need to push a fix to avoid calling it there.

@andyferris This PR is only for the implementation of the codegen support. It doesn't implement support for representing unions efficiently in vectors.

@andyferris
Copy link
Member

This PR is only for the implementation of the codegen support. It doesn't implement support for representing unions efficiently in vectors.

Right - I'm getting ahead of you (because this is exciting 😄 )

@ararslan ararslan added this to the 0.6.0 milestone Feb 16, 2017
@ararslan
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs = ":master")

@tkelman tkelman added the needs docs Documentation for this change is required label Feb 16, 2017
@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@vtjnash
Copy link
Member Author

vtjnash commented Feb 16, 2017

Wow cool, I've never created a 14552.04% slowdown before. I didn't even have to allocate anything!

@tkelman
Copy link
Contributor

tkelman commented Feb 17, 2017

@nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

src/codegen.cpp Outdated
jl_cgval_t value; // a value, if the var is unboxed or SSA (and thus boxroot == NULL)
Instruction *boxroot; // an address, if the var might be in a jl_value_t** stack slot (marked tbaa_const, if appropriate)
jl_cgval_t value; // a stack slot or constant value
Value *pTIndex; // where the current value is stored
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has the wrong comment

@vtjnash
Copy link
Member Author

vtjnash commented Feb 21, 2017

@nanosoldier runbenchmarks(ALL, vs = ":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@JeffBezanson
Copy link
Member

JeffBezanson commented Feb 21, 2017

Based on a discussion with @vtjnash yesterday, my understanding was that there was a ~10% slowdown in the raytracer benchmark (in a marginal case involving type-unstable code) due to the new calling convention for functions with Union return types. However it looks like the latest commit has fixed that. If so, that's awesome and this LGTM.

@JeffBezanson
Copy link
Member

Not related to the changes here, but I'm confused by the comment on the tbaa field of cgval_t:

    MDNode *tbaa; // The related tbaa node. Non-NULL iff this is not a pointer.
    bool ispointer() const
    {
        return tbaa != nullptr;
    }

The comment says non-NULL if this is not a pointer, but when it's non-null ispointer returns true! My guess is that the function should be called something like hasaddress instead? ispointer makes me think the value is represented as a jl_value_t*.

@vtjnash
Copy link
Member Author

vtjnash commented Feb 21, 2017

I think there's just an extra not in that sentence that was not intended. A jl_value_t would have ispointer and isboxed set. It tells you that you can directly ask for the address of the value / field of the value.

with names, via the usual symbol resolution mechanism in the linker.

Note too that ccall functions are also handled separately,
via a manual GOT + PLT.
Copy link
Contributor

@tkelman tkelman Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spell out acronyms

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

standard jargon exists for a reason

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what, to confuse people? this is documentation, very welcome to have it at all, just asking for it to be made slightly clearer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because Global Offset Table is not actually clearer, just longer

Copy link
Contributor

@tkelman tkelman Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is clearer - I can look that up and go find out more about that if it's clear that's what's being referred to, unlike GOT on its own without any explanation, where first the reader needs to figure out what the acronym stands for

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's completely standard practice to spell acronyms out the first time you use them, then use the acronym from then on if you must http://blog.apastyle.org/apastyle/abbreviations/#Q1

otherwise the term serves no purpose to anyone who doesn't know what you're referring to

Copy link
Member

@ararslan ararslan Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also folks like me who don't have a solid background with this stuff; I didn't know that GOT was Global Offset Table, and I still don't know what PLT is. Big +1 for spelling out acronyms, at least the first time they're used. I think that would help a lot.

Copy link
Member

@StefanKarpinski StefanKarpinski Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtjnash: In any writing, you should spell out the first usage, and put the abbreviation in parens. From then on in the same document you may refer to it by abbreviation. This is a quite universal standard in writing, and claiming otherwise is simply wrong. If you're feeling extra helpful to your reader, you can also link to the wikipedia article on the topic from the first (spelled out) usage of the term.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, it's better to have this documentation and others can fix up the formatting and writing as long as we know what it means. I would not have known what GOT stood for without asking you, however. PLT I happen to know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GOT is a pretty standard acronym though... Game of Thrones of course.

`mark_julia_type` (for immediate values) and `mark_julia_slot` (for pointers to values).

The function `convert_julia_type` can transform between any two types.
When it returns and returns an `cgval.typ` set to `typ`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unclear - reword?

primitives to implement union-splitting.

The representation of the tagged-union is as a pair
of < void* union, byte selector >.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code highlight ?

It records the one-based depth-first count into the type-union of the
isbits objects inside. An index of zero indicates that the `union*` is
actually a tagged heap-allocated `jl_value_t*`,
and needs to treated as normal for a boxed object rather than as a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be treated as normal

tagged union.

The high bit of the selector (`byte & 0x80`) can be tested to determine if the
`void*` if the `void*` is actually a heap-allocated box,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate "if the void*"

- Tuples of VecElement types get passed in vector registers.
- Structs get passed on the stack.
- Return values are handle similarly to arguments,
with a size-cutoff at which they will instead by returned via a hidden sret argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will instead be returned

@tkelman tkelman removed the needs docs Documentation for this change is required label Feb 21, 2017
@JeffBezanson
Copy link
Member

Thanks for adding the devdocs. That is very helpful.

Note that extern functions are handled separately,
with names, via the usual symbol resolution mechanism in the linker.

Note too that ccall functions are also handled separately,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps code format ccall, i.e. ccall?

When it returns and returns an `cgval.typ` set to `typ`.
It'll cast the object to the requested representation.
It'll make boxes, allocate stack copies, and compute tagged unions as
needed to perform the request.
Copy link
Member

@ararslan ararslan Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sentences could be combined, e.g. "It will cast the object to the requested representation, making boxes, allocating stack copies, and computing tagged unions [in the process] as needed."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that necessarily improves readability – it just seems like a even trade, imo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these later "it will" sentences are supposed to be connected to the "when it returns" fragment, that could be made clearer either in sentence structure or markdown formatting

- emit_sizeof
- boxed
- unbox
- specialized cc-ret
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if you could code format these

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally reluctant to code-format every word that also appears in code. In part because it is not really standard grammar, and in part because I try to avoid having every other word be bracketed :P

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just mean the things that only refer to functions or other code-specific thingamabobs that aren't part of normal prose. But if you're happy with it as it is, that's fine with me.

thus avoiding the cost of re-allocating a box,
while maintaining the ability to efficiently handle union-splitting based on the low bits.

It is guaranteed that `byte & 0x7f` is an exact test for the type,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either "since" should follow the comma, or the comma should be replaced by a semicolon.

doc/src/index.md Outdated
@@ -72,6 +72,7 @@
* Documentation of Julia's Internals
* [Initialization of the Julia runtime](@ref)
* [Eval of Julia code](@ref)
* [High-level Overview of the Native-Code Generation Process](@ref)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might put this after the memory layout section, since that and the preceding sections are helpful to understand prior to reading this (or at least it seems that way to me)

starting a document describing the overall code-generator structure
with names, via the usual symbol resolution mechanism in the linker.

Note too that ccall functions are also handled separately,
via a manual GOT + PLT.
Copy link
Member

@StefanKarpinski StefanKarpinski Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtjnash: In any writing, you should spell out the first usage, and put the abbreviation in parens. From then on in the same document you may refer to it by abbreviation. This is a quite universal standard in writing, and claiming otherwise is simply wrong. If you're feeling extra helpful to your reader, you can also link to the wikipedia article on the topic from the first (spelled out) usage of the term.

@StefanKarpinski
Copy link
Member

I've changed my review to "approve" on the premise that these docs are incomplete and when this is all done, @vtjnash will spell out the first instance of each acronym and link to the wikipedia page.

@vtjnash vtjnash merged commit fa167f5 into master Feb 22, 2017
@vtjnash vtjnash deleted the jn/union-codegen branch February 22, 2017 19:45
# High-level Overview of the Native-Code Generation Process


<placeholder>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planning to include something here in this PR or in a subsequent one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subsequent one it is 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I put it there so that nobody could complain that the document was immensely incomplete at describing codegen, and just jumps to explain a few small areas. It almost worked :)

@maleadt
Copy link
Member

maleadt commented Mar 9, 2017

@vtjnash: what needs to happen for that /*TODO: min_align*/1 -- where does the alignment information need to come from?

I'm debugging an issue where a hot size on an array has alignment 1, which is very costly on GPU:

@inline foo(t) = 1 < 2 ? t[1] : 1   # might seem idiotic, but resembles Base.size
bar(t) = foo(t)
code_llvm(bar, (NTuple{1,Int},))

%2 = load i64, i64* %1, align 1

(funny how this triggers different code paths, but that's in part due to #17880 I guess)

@vtjnash
Copy link
Member Author

vtjnash commented Mar 9, 2017

Likely need to track where the pointer came from in the jl_cgval_t, or make some conservative estimates like we do everywhere else. I didn't do that already mostly because x86 doesn't care.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code types and dispatch Types, subtyping and method dispatch
Projects
None yet
Development

Successfully merging this pull request may close these issues.