codegen support for efficient union representations #20593

vtjnash · 2017-02-13T04:25:41Z

This implements the codegen primitives and return-type calling convention to make moving around (generating, storing, inspecting, returning) inferred union types more efficient.

The representation of the tagged-union is as a pair consisting of a memory pointer of unspecified size and a byte describing its type. The memory pointer may also be a jl_value_t* box, which will be reflected in it's type tag. For more extensive details, see the commit message.

JeffBezanson · 2017-02-13T04:39:33Z

Cool! First request: copy that commit message into the devdocs somewhere :)

johnmyleswhite · 2017-02-13T05:18:56Z

So excited to see this being worked on. Am I right to think there are still performance optimizations that can be made? For now, I'm using the following snippet to check the performance of union types:

function stable(n)
    s = 0.0
    for i in 1:n
        s += sin(1.0)
    end
    return s
end

function unstable(n)
    s = 0
    for i in 1:n
        s += sin(1.0)
    end
    return s
end

@time stable(10_000_000)
@time unstable(10_000_000)

Currently this branch produces very similar behavior as Julia 0.5 did: the union type of s in unstable causes about a 2x latency degradation and allocates a lot of memory.

andyferris · 2017-02-13T09:58:16Z

Very cool, Jameson. I'm guessing from the OP and John's comment that the remaining half of work is to make unions of two isbits types isbits themselves? Would these be laid out be like a variant/Switch, or overlapped like a C union?

vtjnash · 2017-02-13T17:13:03Z

That's a boring example. I can cut down the allocation count, but they don't really cost any time relative to sin. The performance variation seen here is entirely due to a processor bug / feature. If you @profile both, you'll see that >99% of time in all versions is spent in the sin kernel:

julia> @time stable(10_000_000) # comparison
  0.153794 seconds (5 allocations: 176 bytes)
8.414709847530957e6

julia> @time unstable(10_000_000) # altered version of PR
  0.318522 seconds (5 allocations: 176 bytes)
8.414709847530957e6

julia> @time unstable(10_000_000) # master last week
  0.365068 seconds (30.00 M allocations: 457.764 MiB, 3.54% gc time)
8.414709847530957e6

julia> @time unstable(10_000_000) # master today
  0.323274 seconds (10.00 M allocations: 152.588 MiB, 2.94% gc time)
8.414709847530957e6

A slightly more interesting version is:

julia> @noinline sumup(x::Int) = iseven(x) ? x / 2 : x + 3
sumup (generic function with 1 method)

julia> function f(c)
         x::Int = c
         while x != 1
           x = sumup(x)
         end
       end
f (generic function with 1 method)

julia> @time for i = 1:10^7; f(166138751); end # altered version of PR
  3.253064 seconds

julia> @time for i = 1:10^7; f(166138751); end # master
 11.707397 seconds (650.00 M allocations: 9.686 GiB, 2.39% gc time)

Although even here it is very difficult to craft a meaningful benchmark since most of the cost is in using a floating point representation, and the cost of allocation is relatively just noise:

julia> @noinline sumup(x) = (xi = unsafe_trunc(Int, x); iseven(xi) ? div(x, 2) : x + 3)
sumup (generic function with 1 method)

julia> function f(c)
         x = c
         while x != 1
           x = sumup(x)
         end
       end

julia> @time for i = 1:10^7; f(166138751); end
  0.977569 seconds

julia> @time for i = 1:10^7; f(166138751.0); end
  4.740574 seconds

johnmyleswhite · 2017-02-14T02:22:38Z

That's a boring example. I can cut down the allocation count, but they don't really cost any time relative to sin. The performance variation seen here is entirely due to a processor bug / feature. If you @profile both, you'll see that >99% of time in all versions is spent in the sin kernel:

I'm confused: if 100% of the cost of evaluating both stable and unstable was the cost of repeatedly evaluating sin, wouldn't the two implementations both finish in 0.15 seconds since they both evaluate sin 10,000,000 times and only differ in steps other than the evaluation of sin? My mental model is that the cost should be total_time = time_in_sin + time_other and that the two functions should share time_in_sin, so there's clearly a lot of time spent in other steps, even if those steps aren't allocations as your example shows. What am I missing from that model of the cost of evaluating unstable?

andyferris · 2017-02-14T02:55:17Z

To be honest, I am also a little unclear what those benchmarks are showing...

(also - what is "altered version of PR"? is that work on the isbits case - you say "I can cut down the allocation count"?)

vtjnash · 2017-02-14T03:22:07Z

My mental model is that the cost should be ...

That cost model only works for macro scale. In this case, the cost model is basically P(processor detects the loop) + P(register stalls). And yes, that a probability estimate, not a performance characteristic. There's some factors that are roughly controllable to increase P (fewer branches, smaller loop, predictable memory accesses, the implementation of register dependencies / size of out-of-order pipeline), but it's not really entirely controllable (only Intel knows the implementation details, and they haven't shared anything. although we know that it differs very widely by processor microarchitecture family and node).

(yeah, the altered case is some test code I put together that does some inference re-arrangements to prevent allocation in this case in exchange for worse code in the general case. Since it's just inference work, it doesn't really matter for this PR – I could have gotten the same result just by writing out some of the extra branches in those function manually).

andyferris · 2017-02-14T04:03:06Z

Interesting.

Is there some more generic cases that we can test with this branch? Maybe DataArrays would be a good example of the kind of things this PR will solve, but it seems to be broken on v0.6.

vchuravy · 2017-02-14T05:38:01Z

On 32bit windows:

julia: /home/travis/build/JuliaLang/julia/src/codegen.cpp:503: jl_cgval_t::jl_cgval_t(const jl_cgval_t&, jl_value_t*, llvm::Value*): Assertion `isboxed || v.typ == typ || tindex' failed.

I could spend a few cycles looking into this if we are pushing this for v0.6.

andyferris · 2017-02-14T05:52:49Z

Hmm... Is this a regression? My benchmark was a bit strange. Let's call this test "DataArrays lite":

immutable NA; end
Base.:+(::NA, ::Int) = NA()
Base.:+(::NA, ::NA) = NA()
Base.:+(::Int, ::NA) = NA()
n = 10_000_000
v1 = Vector{Union{Int,NA}}(n)
v2 = Vector{Union{Int,NA}}(n)
for i = 1:n
    if rand() < 0.6666
        v1[i] = rand(1:100)
    else
        v1[i] = NA()
    end
    if rand() < 0.6666
        v2[i] = rand(1:100)
    else
        v2[i] = NA()
    end
end
v1 + v2

And the benchmark results:

julia> @time v1 + v2;  # v0.5.0
  0.439152 seconds (10.00 M allocations: 228.874 MB, 24.30% gc time)

julia> @time v1 + v2;  # master
  0.510497 seconds (10.00 M allocations: 305.180 MiB, 23.54% gc time)

julia> @time v1 + v2;  # this PR
  0.747177 seconds (10.00 M allocations: 228.887 MiB, 17.05% gc time)

Does this make sense?

When the isbits stuff is done, I was expecting this to be within a factor of two (edit: 6?) of the Vector{Int} speed (0.031480 seconds (7 allocations: 76.294 MB, 6.53% gc time)).

EDIT: pre-allocating the output array v3:

julia> @time map!(+, v3, v1, v2);   # v0.5.0
  0.321886 seconds (10.00 M allocations: 152.580 MB, 2.38% gc time)

julia> @time map!(+, v3, v1, v2);   # master
  0.336431 seconds (10.00 M allocations: 152.580 MiB, 0.92% gc time)

julia> @time map!(+, v3, v1, v2);   # this PR
  0.394033 seconds (10.00 M allocations: 152.580 MiB, 1.68% gc time)

julia> @time map!(+, v3, v1, v2);   # Vector{Int} case, for comparison
  0.026852 seconds (4 allocations: 160 bytes)

andyferris · 2017-02-14T06:11:30Z

And finally, by using/extending nullables on v0.5:

function Base.:+(i1::Nullable{Int}, i2::Nullable{Int}) 
    if isnull(i1) || isnull(i2)
        return Nullable{Int}()
    else
        return Nullable(get(i1) + get(i2))
    end
end
v1 = Vector{Nullable{Int}}(n)
v2 = Vector{Nullable{Int}}(n)
v3 = Vector{Nullable{Int}}(n)
for i = 1:n
    if rand() < 0.6666
        v1[i] = Nullable(rand(1:100))
    else
        v1[i] = Nullable{Int}()
    end
    if rand() < 0.6666
        v2[i] = Nullable(rand(1:100))
    else
        v2[i] = Nullable{Int}()
    end
end

yields

julia> @time map!(+, v3, v1, v2);
  0.132412 seconds (4 allocations: 160 bytes)

which is about 6 times slower than the Int case and about 3 times faster than this PR currently (I didn't attempt to optimize that kernel for +).

andreasnoack · 2017-02-14T14:16:12Z

Maybe DataArrays would be a good example of the kind of things this PR will solve, but it seems to be broken on v0.6.

You can try this PR JuliaStats/DataArrays.jl#235

vtjnash · 2017-02-14T15:23:45Z

@vchuravy I already looked at it – the assertion is wrong. I just need to push a fix to avoid calling it there.

@andyferris This PR is only for the implementation of the codegen support. It doesn't implement support for representing unions efficiently in vectors.

andyferris · 2017-02-14T23:04:07Z

This PR is only for the implementation of the codegen support. It doesn't implement support for representing unions efficiently in vectors.

Right - I'm getting ahead of you (because this is exciting 😄 )

ararslan · 2017-02-16T18:06:13Z

@nanosoldier runbenchmarks(ALL, vs = ":master")

nanosoldier · 2017-02-16T21:17:15Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

vtjnash · 2017-02-16T21:44:14Z

Wow cool, I've never created a 14552.04% slowdown before. I didn't even have to allocate anything!

tkelman · 2017-02-17T08:17:12Z

@nanosoldier runbenchmarks(ALL, vs = ":master")

nanosoldier · 2017-02-17T11:27:30Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

vtjnash · 2017-02-20T22:46:26Z

src/codegen.cpp

-    jl_cgval_t value; // a value, if the var is unboxed or SSA (and thus boxroot == NULL)
+    Instruction *boxroot; // an address, if the var might be in a jl_value_t** stack slot (marked tbaa_const, if appropriate)
+    jl_cgval_t value; // a stack slot or constant value
+    Value *pTIndex; // where the current value is stored


this has the wrong comment

vtjnash · 2017-02-21T08:49:21Z

@nanosoldier runbenchmarks(ALL, vs = ":master")

nanosoldier · 2017-02-21T12:15:56Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

JeffBezanson · 2017-02-21T16:01:44Z

Based on a discussion with @vtjnash yesterday, my understanding was that there was a ~10% slowdown in the raytracer benchmark (in a marginal case involving type-unstable code) due to the new calling convention for functions with Union return types. However it looks like the latest commit has fixed that. If so, that's awesome and this LGTM.

JeffBezanson · 2017-02-21T17:52:11Z

Not related to the changes here, but I'm confused by the comment on the tbaa field of cgval_t:

    MDNode *tbaa; // The related tbaa node. Non-NULL iff this is not a pointer.
    bool ispointer() const
    {
        return tbaa != nullptr;
    }

The comment says non-NULL if this is not a pointer, but when it's non-null ispointer returns true! My guess is that the function should be called something like hasaddress instead? ispointer makes me think the value is represented as a jl_value_t*.

vtjnash · 2017-02-21T18:16:54Z

I think there's just an extra not in that sentence that was not intended. A jl_value_t would have ispointer and isboxed set. It tells you that you can directly ask for the address of the value / field of the value.

tkelman · 2017-02-21T18:52:19Z

doc/src/devdocs/compiler.md

+with names, via the usual symbol resolution mechanism in the linker.
+
+Note too that ccall functions are also handled separately,
+via a manual GOT + PLT.


spell out acronyms

standard jargon exists for a reason

what, to confuse people? this is documentation, very welcome to have it at all, just asking for it to be made slightly clearer

because Global Offset Table is not actually clearer, just longer

yes it is clearer - I can look that up and go find out more about that if it's clear that's what's being referred to, unlike GOT on its own without any explanation, where first the reader needs to figure out what the acronym stands for

it's completely standard practice to spell acronyms out the first time you use them, then use the acronym from then on if you must http://blog.apastyle.org/apastyle/abbreviations/#Q1

otherwise the term serves no purpose to anyone who doesn't know what you're referring to

There are also folks like me who don't have a solid background with this stuff; I didn't know that GOT was Global Offset Table, and I still don't know what PLT is. Big +1 for spelling out acronyms, at least the first time they're used. I think that would help a lot.

@vtjnash: In any writing, you should spell out the first usage, and put the abbreviation in parens. From then on in the same document you may refer to it by abbreviation. This is a quite universal standard in writing, and claiming otherwise is simply wrong. If you're feeling extra helpful to your reader, you can also link to the wikipedia article on the topic from the first (spelled out) usage of the term.

That said, it's better to have this documentation and others can fix up the formatting and writing as long as we know what it means. I would not have known what GOT stood for without asking you, however. PLT I happen to know.

GOT is a pretty standard acronym though... Game of Thrones of course.

tkelman · 2017-02-21T18:54:05Z

doc/src/devdocs/compiler.md

+`mark_julia_type` (for immediate values) and `mark_julia_slot` (for pointers to values).
+
+The function `convert_julia_type` can transform between any two types.
+When it returns and returns an `cgval.typ` set to `typ`.


unclear - reword?

tkelman · 2017-02-21T18:54:50Z

doc/src/devdocs/compiler.md

+primitives to implement union-splitting.
+
+The representation of the tagged-union is as a pair
+of < void* union, byte selector >.


code highlight ?

tkelman · 2017-02-21T18:55:24Z

doc/src/devdocs/compiler.md

+It records the one-based depth-first count into the type-union of the
+isbits objects inside. An index of zero indicates that the `union*` is
+actually a tagged heap-allocated `jl_value_t*`,
+and needs to treated as normal for a boxed object rather than as a


needs to be treated as normal

tkelman · 2017-02-21T18:55:40Z

doc/src/devdocs/compiler.md

+tagged union.
+
+The high bit of the selector (`byte & 0x80`) can be tested to determine if the
+`void*` if the `void*` is actually a heap-allocated box,


duplicate "if the void*"

tkelman · 2017-02-21T18:57:01Z

doc/src/devdocs/compiler.md

+- Tuples of VecElement types get passed in vector registers.
+- Structs get passed on the stack.
+- Return values are handle similarly to arguments,
+  with a size-cutoff at which they will instead by returned via a hidden sret argument.


will instead be returned

JeffBezanson · 2017-02-21T19:01:45Z

Thanks for adding the devdocs. That is very helpful.

ararslan · 2017-02-21T19:07:15Z

doc/src/devdocs/compiler.md

+Note that extern functions are handled separately,
+with names, via the usual symbol resolution mechanism in the linker.
+
+Note too that ccall functions are also handled separately,


Perhaps code format ccall, i.e. ccall?

ararslan · 2017-02-21T19:08:59Z

doc/src/devdocs/compiler.md

+When it returns and returns an `cgval.typ` set to `typ`.
+It'll cast the object to the requested representation.
+It'll make boxes, allocate stack copies, and compute tagged unions as
+needed to perform the request.


These sentences could be combined, e.g. "It will cast the object to the requested representation, making boxes, allocating stack copies, and computing tagged unions [in the process] as needed."

I'm not sure that necessarily improves readability – it just seems like a even trade, imo.

if these later "it will" sentences are supposed to be connected to the "when it returns" fragment, that could be made clearer either in sentence structure or markdown formatting

ararslan · 2017-02-21T19:10:10Z

doc/src/devdocs/compiler.md

+- emit_sizeof
+- boxed
+- unbox
+- specialized cc-ret


Would be nice if you could code format these

I'm generally reluctant to code-format every word that also appears in code. In part because it is not really standard grammar, and in part because I try to avoid having every other word be bracketed :P

I just mean the things that only refer to functions or other code-specific thingamabobs that aren't part of normal prose. But if you're happy with it as it is, that's fine with me.

ararslan · 2017-02-21T19:11:56Z

doc/src/devdocs/compiler.md

+thus avoiding the cost of re-allocating a box,
+while maintaining the ability to efficiently handle union-splitting based on the low bits.
+
+It is guaranteed that `byte & 0x7f` is an exact test for the type,


I think either "since" should follow the comma, or the comma should be replaced by a semicolon.

ararslan · 2017-02-21T19:14:48Z

doc/src/index.md

@@ -72,6 +72,7 @@
  * Documentation of Julia's Internals
      * [Initialization of the Julia runtime](@ref)
      * [Eval of Julia code](@ref)
+      * [High-level Overview of the Native-Code Generation Process](@ref)


I might put this after the memory layout section, since that and the preceding sections are helpful to understand prior to reading this (or at least it seems that way to me)

starting a document describing the overall code-generator structure

StefanKarpinski · 2017-02-22T14:27:45Z

doc/src/devdocs/compiler.md

+with names, via the usual symbol resolution mechanism in the linker.
+
+Note too that ccall functions are also handled separately,
+via a manual GOT + PLT.


@vtjnash: In any writing, you should spell out the first usage, and put the abbreviation in parens. From then on in the same document you may refer to it by abbreviation. This is a quite universal standard in writing, and claiming otherwise is simply wrong. If you're feeling extra helpful to your reader, you can also link to the wikipedia article on the topic from the first (spelled out) usage of the term.

StefanKarpinski · 2017-02-22T19:44:19Z

I've changed my review to "approve" on the premise that these docs are incomplete and when this is all done, @vtjnash will spell out the first instance of each acronym and link to the wikipedia page.

ararslan · 2017-02-22T19:45:12Z

doc/src/devdocs/compiler.md

+# High-level Overview of the Native-Code Generation Process
+
+
+    <placeholder>


Are you planning to include something here in this PR or in a subsequent one?

Subsequent one it is 😄

Yeah, I put it there so that nobody could complain that the document was immensely incomplete at describing codegen, and just jumps to explain a few small areas. It almost worked :)

maleadt · 2017-03-09T16:28:34Z

@vtjnash: what needs to happen for that /*TODO: min_align*/1 -- where does the alignment information need to come from?

I'm debugging an issue where a hot size on an array has alignment 1, which is very costly on GPU:

@inline foo(t) = 1 < 2 ? t[1] : 1   # might seem idiotic, but resembles Base.size
bar(t) = foo(t)
code_llvm(bar, (NTuple{1,Int},))

%2 = load i64, i64* %1, align 1

(funny how this triggers different code paths, but that's in part due to #17880 I guess)

vtjnash · 2017-03-09T21:24:25Z

Likely need to track where the pointer came from in the jl_cgval_t, or make some conservative estimates like we do everywhere else. I didn't do that already mostly because x86 doesn't care.

vtjnash requested a review from JeffBezanson February 13, 2017 04:26

ararslan added compiler:codegen Generation of LLVM IR and native code types and dispatch Types, subtyping and method dispatch labels Feb 13, 2017

vtjnash force-pushed the jn/union-codegen branch from 8578b20 to 035f313 Compare February 16, 2017 04:50

ararslan added this to the 0.6.0 milestone Feb 16, 2017

tkelman added the needs docs Documentation for this change is required label Feb 16, 2017

vtjnash commented Feb 20, 2017

View reviewed changes

vtjnash force-pushed the jn/union-codegen branch from be88ee9 to 4109326 Compare February 21, 2017 02:43

better optimize code for copying a value to an unboxed union

6131a0f

JeffBezanson approved these changes Feb 21, 2017

View reviewed changes

vtjnash force-pushed the jn/union-codegen branch from 4109326 to 702e20d Compare February 21, 2017 18:36

tkelman reviewed Feb 21, 2017

View reviewed changes

tkelman removed the needs docs Documentation for this change is required label Feb 21, 2017

ararslan reviewed Feb 21, 2017

View reviewed changes

add union-splitting commit comment to devdocs

eac043a

starting a document describing the overall code-generator structure

vtjnash force-pushed the jn/union-codegen branch from 702e20d to eac043a Compare February 21, 2017 20:23

StefanKarpinski requested changes Feb 22, 2017

View reviewed changes

StefanKarpinski approved these changes Feb 22, 2017

View reviewed changes

vtjnash merged commit fa167f5 into master Feb 22, 2017

vtjnash deleted the jn/union-codegen branch February 22, 2017 19:45

ararslan reviewed Feb 22, 2017

View reviewed changes

maleadt mentioned this pull request Mar 10, 2017

Use proper alignment when copying data to a variable. #20975

Merged

yuyichao mentioned this pull request Mar 15, 2017

GC error during Pkg test #21015

Closed

		# High-level Overview of the Native-Code Generation Process


		<placeholder>

codegen support for efficient union representations #20593

codegen support for efficient union representations #20593

Conversation

vtjnash commented Feb 13, 2017

JeffBezanson commented Feb 13, 2017

johnmyleswhite commented Feb 13, 2017

andyferris commented Feb 13, 2017

vtjnash commented Feb 13, 2017

johnmyleswhite commented Feb 14, 2017

andyferris commented Feb 14, 2017

vtjnash commented Feb 14, 2017

andyferris commented Feb 14, 2017

vchuravy commented Feb 14, 2017

andyferris commented Feb 14, 2017 • edited Loading

andyferris commented Feb 14, 2017

andreasnoack commented Feb 14, 2017

vtjnash commented Feb 14, 2017

andyferris commented Feb 14, 2017

ararslan commented Feb 16, 2017

nanosoldier commented Feb 16, 2017

vtjnash commented Feb 16, 2017

tkelman commented Feb 17, 2017

nanosoldier commented Feb 17, 2017

Choose a reason for hiding this comment

vtjnash commented Feb 21, 2017

nanosoldier commented Feb 21, 2017

JeffBezanson commented Feb 21, 2017 • edited Loading

JeffBezanson commented Feb 21, 2017

vtjnash commented Feb 21, 2017

tkelman Feb 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman Feb 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ararslan Feb 22, 2017 • edited Loading

Choose a reason for hiding this comment

StefanKarpinski Feb 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JeffBezanson commented Feb 21, 2017

Choose a reason for hiding this comment

ararslan Feb 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanKarpinski Feb 22, 2017 • edited Loading

Choose a reason for hiding this comment

StefanKarpinski commented Feb 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maleadt commented Mar 9, 2017 • edited Loading

vtjnash commented Mar 9, 2017

andyferris commented Feb 14, 2017 •

edited

Loading

JeffBezanson commented Feb 21, 2017 •

edited

Loading

tkelman Feb 21, 2017 •

edited

Loading

tkelman Feb 21, 2017 •

edited

Loading

ararslan Feb 22, 2017 •

edited

Loading

StefanKarpinski Feb 22, 2017 •

edited

Loading

ararslan Feb 21, 2017 •

edited

Loading

StefanKarpinski Feb 22, 2017 •

edited

Loading

maleadt commented Mar 9, 2017 •

edited

Loading