WIP: Incremental GC #5227

carnaval · 2013-12-23T19:58:02Z

This is work I started when hitting long GC pause in a soft realtime application. No matter how low you can bring the allocation count down (and the language admittedly makes it easy), there is still a point where you must wait for tens of ms which is a pain in, e.g., render code at 60 fps.

To be clear this is only incremental marking, but incremental sweeping should be much easier.
This patch :

Adds a new gc bit on each type field of each julia object which records if it is scheduled for marking (the so called grey bit)
Adds a write barrier to all pointer stores to gc-managed slots (in C runtime and codegen)
Make the GC able to run partial marking phase
Adds a verifier which can run before each sweep to verify we did not miss a write.

Some points :

The write barrier is explicit in C code. Most of it looks like :

a->b = ...;
gc_wb(a, a->b);

This is ugly and could be resolved with a macro (something like gc_store(a, b, ...);) but I was not sold on changing such pervasive syntax.

The current heuristics are quite bad : the collect interval is simply divided by 10 and partial markings works with an increasing timeout (1ms then 1.3*previous_timeout). I expect some low hanging fruits for efficiency here, if only because I chose those constants arbitrarily ;)
Performances : nothing is free so there is a perf hit. I believe it could be brought down as I was mainly focused on getting it correct until recently. My measurements are on running the full test suite. It takes about 10% more time to run but the pause times (for the mark phase) goes down from : max 40ms, 95% sub 30ms to max 5ms, 95% sub 2ms.
Even if low-pause is not a focus point for the julia community ATM, the write barrier can be used to make the GC generational. In fact, you could turn this PR into a poor man's generational GC simply by not reseting the marks at the end of a sweep. I did not experiment in this direction yet, but doing it properly would require the GC to support moving, which is its own can of worms.

To enable, build with -DGC_INC. GC_VERIFY enables the verifier and GC_TIME prints pause timings.

Of course, feedback is much appreciated. Cheers !

JeffBezanson · 2013-12-23T20:16:00Z

Wow, this is really impressive. It would be great to have the option to build with a low-pause GC for users who might want that. Having multiple GCs available is really cool.

I'm planning to work on a new default GC as well. It also requires a write barrier, so that work will not be wasted.

What I want to try is taking advantage of the fact that garbage is very likely to have been referenced only from the stack (i.e. temporary values). So I'd set a bit on heap-referenced values, then during a "minor" GC I'd only mark objects currently on the stack, then do a full sweep. That skips tracing all heap objects, which takes most of the time in my experiments. It seems like you'd be an ideal person to evaluate this idea.

StefanKarpinski · 2013-12-23T21:39:19Z

Very cool work. I worry about having multiple GCs because each one might not get enough testing to be really reliable. It would be ideal, imo, to have a single standard GC that can meet everyone's needs, perhaps with a little assist from the high-level coding style (e.g. if you have real-time requirements, always pre-allocate temporary arrays). On the other hand, GC might be an area where it's not actually possible to satisfy everyone with one system.

JeffBezanson · 2013-12-23T22:01:39Z

Whether you care about pause times is a pretty hard dividing line in the GC design space. You can't address this by changing how user code is written, because you also call libraries, and our compiler also allocates objects if it wants to. Even if our GC is so good it takes 1ms 99.9% of the time, that still isn't low-latency. The 0.1% ruins it. Limited pause time is a harsh requirement.

@carnaval Is there an argument that this eventually collects all garbage? I believe it might but it's not obvious. For example, say you constantly add new nodes to a large tree, and on rare occasion unreference a big chunk of it. Will the GC always use up its allotted time dealing with the new updates?

StefanKarpinski · 2013-12-23T22:11:30Z

Whether you care about pause times is a pretty hard dividing line in the GC design space.

Yes, that means requirements on the "one GC" would be pretty severe – high enough throughput for general usage and guaranteed low-latency for real-time requirements. Not even sure if that's doable and the alternative would be having a small set (maybe only two) of GCs that can be swapped out.

carnaval · 2013-12-23T22:35:11Z

It should collect all garbage eventually because I increase the timeout if the mark queue is not shrinking, until a cutoff value where I do a full collection.
As I said, those heuristics are pretty bad IMO and this would be the next thing I'm looking into.

I'd be happy to do some tests on different ways to do quick collections after this. A first step would be having some basic infrastructure to measure dfferences in gc pause/throughput. ATM I have some basic scripts that need some cleanup but I believe it would be beneficial to automate it to spot regressions (maybe an entry on codespeed ?)

About having multiple gc, in this PR I duplicated the codepath because it allowed me to trust the verification (which is inline to the marking code, meh) but it could be merged quite easily and I don't think there should be any penalty in saying that a full collection is an incremental one with infinte timeout.
As you said, this would not ensure proper testing of the incremental code path, but I believe that simply making the CI server run the testsuite with both modes would be a good first step.

Btw, congrats on the codebase it's really clean and quite nice to dive into.

JeffBezanson · 2013-12-23T22:54:22Z

Thanks.

It would be nice if we could somehow avoid calling clock_now() for every object. Surely that has some overhead.

I assume all the verification stuff is just for debugging? It doesn't all seem to get disabled by GC_VERIFY.

carnaval · 2013-12-23T23:01:38Z

About clock_now, I had a counter at some point to only check once every N objects but didn't see measurable improvements. However a lot of things changed since then so it may very well be worth it now. I will have a look.

You're right, seems like I forgot to disable the verification macros when GC_VERIFY is not defined.

JeffBezanson · 2013-12-24T03:03:32Z

A couple formatting points to fix:

space between if and ( (if (x) ...)
newline before opening { in function definitions
f(void) for 0-argument function declarations
newline between } and else instead of } else {
if one part of an if..else chain uses { } then all should

StefanKarpinski · 2013-12-24T04:24:17Z

f(void) for 0-argument function declarations

This one isn't actually a style point, but required for correctly declaring a zero-argument C function.

ViralBShah · 2013-12-24T18:21:42Z

This is really impressive, and would enable lots of new applications.

carnaval · 2013-12-26T16:47:39Z

Let me know if I missed anything.

ssfrr · 2014-01-01T19:35:46Z

I'm really excited about bringing the max latency down to enable realtime audio applications, so bringing down the GC delay would be a huge step. Thanks for the work on this!

-s

JeffBezanson · 2014-01-01T21:32:00Z

Yes, I plan to merge this.

carnaval · 2014-01-21T12:53:59Z

I finally got around spending a little more time on this.
Couple questions :

should I make it a runtime flag ? It will have never-triggering barrier in the runtime and no barriers in generated code. I will try and measure if this has a noticeable impact.
are CI build times important ? (i.e. is anyone waiting on them ?). If not, is it deemed reasonable to run the testsuite twice, with full collections and partials ?
in codegen.cpp I'm wondering if there is a better way, given a varinfo_t, to know if it is stack allocated than to check the IR definition (see is_stack)

@ssfrr I have no experience in realtime audio, do you happen to know the order of magnitude of tolerable pause times ?

@JeffBezanson About the escape bit, if I understood you correctly, it would not be too hard to implement on top of what I have now. We would still have to sweep the whole heap though. The only way around this is to implement moving and LLVM does not seem really suited for this. In fact, I only see one way to avoid moving an object while in the middle of some computation using its address. It involves reloading every root after every potentially-allocating operation, this does not strikes me as cheap and would probably confuses LLVM opt passes.

ssfrr · 2014-01-22T06:59:05Z

Realtime audio deadlines are around 10-20ms max if you're processing live sound (~5ms is not uncommon). Typically there's an audio callback that gets called with every new frame of incoming audio, and within that context the application processes the sound and returns a new frame to be output. So given a 20ms deadline, ideally as much of that time as possible can be used for processing (if the system is blocked for 10ms of the 20ms, that only allows 50% CPU utilization).

This is the main reason that audio processing software is almost always written in C or C++, with all the memory allocation happening up front and blocking operations like disk access in a separate lower-priority thread.

I think that this is a niche enough use case that I wouldn't expect the GC design to have huge compromises to meet it. It would be nice if it were at least possible to write performant audio processing code in Julia though, even if it required a bit of care on the developer's part. I've got a few audio related projects and have been toying with the idea of using julia to implement the engine.

vtjnash · 2014-01-29T03:34:44Z

What's the status of this? I would like to see it get merged before I start working on #2818, since I expect they will have a fairly high degree of overlap

Taken from Jeff's comments on a recent pull request: #5227 (comment)

carnaval · 2014-01-29T07:14:43Z

I'm cautiously confident that the incremental marking is correct (I've been using it exclusively for some time now).
I added escape tracking and page skipping to allow for quick collections. It appears to give a nice speedup while retaining low pause times.
The only remaining long pauses are sweep phases after a long collection.
On a silly example which allocates a lot of temporaries :

silly(N) = (x = 0; for i=1:N x += {1}[1] end; x)
silly(10); @time silly(40000000)

master : time: 6.727313459 seconds (3199998572 bytes allocated)
with GC_TRACK_ESC : 2.445960981 seconds (3199998572 bytes allocated)

I intend to run a more serious benchmark later today (morning here) comparing master and all combinations of (inc_mark, track_esc) on runtime, pause times, max heap size.
Depending on the measurements, it may make sense to merge it enabled by default - and it would help battle-test the code ;-).
The code is a bit messy for now, I'll clean it up and explain a bit what's happening.
Cheers.

carnaval · 2014-01-29T07:20:43Z

@JeffBezanson By the way, is there any reason for the non power of 2 gc page size on 64bit cpu ? Git blame gives a very old commit.
To quickly check if an object is pooled and on what page I had to use 2^n aligned pages.

timholy · 2014-01-29T12:03:38Z

@carnaval, this is really impressive; faster throughput with short pauses sounds almost too good to be true. (I oh-so-vaguely understand generational gc and why it's not impossible, but still...)

Is the Travis error on the Clang build anything to be worried about? Naively it does look like a potential memory corruption problem.

JeffBezanson · 2014-01-29T17:25:51Z

This is awesome. We should plan to merge it at the start of the 0.4 cycle
to shake out any bugs.
The reason it's not too good to be true is that the existing gc was far
from optimal :-)
On Jan 29, 2014 7:03 AM, "Tim Holy" notifications@github.com wrote:

@carnaval https://github.com/carnaval, this is really impressive;
faster throughput with short pauses sounds almost too good to be true. (I
oh-so-vaguely understand generational gc and why it's not impossible, but
still...)

Is the Travis error on the Clang build anything to be worried about?
Naively it does look like a potential memory corruption problem.

Reply to this email directly or view it on GitHubhttps://github.com//pull/5227#issuecomment-33578689
.

timholy · 2014-01-29T17:49:17Z

Had one tried to make it optimal from the outset, Julia would still be stuck with about 4 users (the perfect being the enemy of the good and all that). But it's really great to see this getting attention, it seems very timely given the improvements that are being made in other areas.

ViralBShah · 2014-01-30T01:38:17Z

How about merging it now but only enabled with a switch?

ssfrr · 2014-05-24T00:57:36Z

In a little non-rigorous testing just now this is performing way better than the standard GC with AudioIO.

I made a sloppy audio node that does a buffer allocate/free on every block of audio rather than iterating. With the normal GC I hear frequent periodic dropouts when I have more then 4-5 of the nodes running at once. With the incremental GC I can get 200 of them running with occasional drop-outs, and 100 was running with no dropouts (again, in my very quick testing)

Granted I'm operating at pretty large buffer sizes right now, but this definitely seems like a huge win.

There was one compile-time issue, I had to change the MAP_ANONYMOUS to MAP_ANON in gc.c. It looks like a linux/OSX compatibility issue (this is an OSX box), perhaps just a

#ifndef MAP_ANONYMOUS
    #define MAP_ANONYMOUS MAP_ANON
#endif

StefanKarpinski · 2014-05-24T02:01:09Z

So what's in the way of merging this?

carnaval · 2014-10-05T12:16:14Z

I'm not sure we can expect a speedup on this benchmark, since the generational hypothesis doesn't model well the behavior of the program. If, for example, you throw away those strings quickly instead of storing them in a huge array you should get an improvement over current results.
In fact, apart from the few temporaries generated by the comprehension, the GC cannot free any memory until the whole array becomes garbage. The optimal behavior would then be something like gc_disable(); build_huge_array(); gc_enable();, which is what the current gc tends to by growing the collection interval geometrically.
I agree that in those cases the heuristic should be smarter and stop trying to do young collections as often (which it does, but not aggressively enough).

This effect is made much worse by the fact that, even if the string array gets promoted to old gen, it must be remarked fully every time because we are stuffing it with young objects. The way around this is to use a card table alongside (or as a replacement of) the remembered set to allow partial invalidation of large arrays. This will make the write barrier more complicated because given a pointer it's not straightforward (i.e. O(1) for a very small value of 1 :-) ) to know how it is stored in memory.

That being said I'm having a look right now to see if there are ways to help this specific case.
I also agree that the perf test suite should be extended, as that's mostly what I'm basing myself upon to check for regressions since I don't have that much of a julia corpus to test things on.

JeffBezanson · 2014-10-05T21:34:52Z

Good point --- the benchmark should be repeated with the new objects being
garbage.

jakebolewski · 2014-10-06T01:34:50Z

That makes sense, thanks for the detailed explanation. You are right that the new GC is much better under the conditions you outlined (although both tests represent extremes of "normal" allocation behvaior).

Julia/julia-dev-33 [master] » ./julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+944 (2014-10-04 23:57 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit c5f136a (1 day old master)
|__/                   |  x86_64-apple-darwin13.4.0

julia> using DataFrames

julia> function test(N)
       for _ in 1:N
           reverse(utf16("this is a test"))
       end
       end
test (generic function with 1 method)

julia> @time test(5_000_000)
elapsed time: 1.410968703 seconds (800865016 bytes allocated, 35.24% gc time)

julia> le = @time open(deserialize, "/Users/jacobbolewski/Julia/benchmarks/labevents.jls");
elapsed time: 9.167030852 seconds (1326598072 bytes allocated, 5.80% gc time)

julia> @time test(5_000_000)
elapsed time: 2.39665995 seconds (800000080 bytes allocated, 58.28% gc time)

julia> @time test(5_000_000)
elapsed time: 6.629903708 seconds (800000080 bytes allocated, 87.44% gc time)

julia> @time test(5_000_000)
elapsed time: 6.533877549 seconds (800000080 bytes allocated, 87.42% gc time)

julia> @time test(5_000_000)
elapsed time: 6.931448025 seconds (800000080 bytes allocated, 88.15% gc time)

julia> @time test(5_000_000)
elapsed time: 6.193674193 seconds (800000080 bytes allocated, 86.91% gc time)

This PR

Julia/julia-dev-33 [newgc] » ./julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+525 (2014-09-28 22:51 UTC)
 _/ |\__'_|_|_|\__'_|  |  newgc/8c54091 (fork: 13 commits, 27 days)
|__/                   |  x86_64-apple-darwin13.4.0

julia> using DataFrames

julia> function test(N)
       for _ in 1:N
           reverse(utf16("this is a test"))
       end
       end
test (generic function with 1 method)

julia> @time test(5_000_000)
elapsed time: 1.03702919 seconds (763 MB allocated, 9.41% gc time)

julia> le = @time open(deserialize, "/Users/jacobbolewski/Julia/benchmarks/labevents.jls");
elapsed time: 9.971573486 seconds (1265 MB allocated, 6.24% gc time)

julia> @time test(5_000_000)
elapsed time: 2.389991338 seconds (762 MB allocated, 53.33% gc time)

julia> @time test(5_000_000)
elapsed time: 1.740331145 seconds (762 MB allocated, 43.69% gc time)

julia> @time test(5_000_000)
elapsed time: 1.314867743 seconds (762 MB allocated, 29.68% gc time)

julia> @time test(5_000_000)
elapsed time: 1.404829552 seconds (762 MB allocated, 32.67% gc time)

julia> @time test(5_000_000)
elapsed time: 1.397260517 seconds (762 MB allocated, 33.38% gc time

JeffBezanson · 2014-10-06T01:59:25Z

Excellent!

JeffBezanson · 2014-10-14T18:49:49Z

@carnaval I just added you to the contributors team. You can move this branch to the main repository, under a name like ob/gengc, and reopen this PR from there. That will allow all of us to easily contribute tweaks leading up to merging this.

carnaval · 2014-10-16T11:28:07Z

Thanks ! I'll try and do this without breaking everything. We all know what comes with great power ... :-)

ihnorton · 2014-10-16T11:55:25Z

Just don't push to master by accident, and especially don't force push to master.

This also includes various allocation changes which should improve performances. There is also a start of generational behavior for <2k objects. This broke the heuristics in the process, still pretty much a WIP.

…nsients, bugfixes... Still in a pretty broken state (at the very least incremental codepath isn't working) @trrousse :-)

…arrier. A bit more cleanup too. Also some missing write barrier in new code.

…no there yet.

…t. A bit more tweaking of the collection heuristics. We are now faster/less memory hungry on almost every benchmark of the micro, kernel & shootout suite.

…llection. Slight cleanup. Address some of Jeff's comments.

…AL_STATS.

vtjnash · 2015-03-11T03:49:13Z

src/gc.c

+}
+
+
+static inline int gc_setmark(void *o, int sz, int mark_mode)


in some cases, I think this is being called with a sz value that doesn't include the jl_value_t *type tag (e.g. when it comes from jl_datatype_size)

vtjnash added a commit that referenced this pull request Jan 29, 2014

Add code formatting guidelines

25684dd

Taken from Jeff's comments on a recent pull request: #5227 (comment)

timholy mentioned this pull request Mar 4, 2014

Slow broadcast addition of matrices JuliaLang/LinearAlgebra.jl#89

Closed

carnaval mentioned this pull request Mar 9, 2014

add missing root #6085

Merged

JeffBezanson mentioned this pull request Apr 1, 2014

unexpected allocation in this code #6357

Closed

stevengj mentioned this pull request Apr 21, 2014

types as C-structs #2818

Closed

5 tasks

ViralBShah assigned JeffBezanson Apr 23, 2014

jiahao force-pushed the master branch from 2ef98c5 to 0388647 Compare October 5, 2014 00:57

jiahao force-pushed the master branch from 6c7c7e3 to 1a4c02f Compare October 11, 2014 22:06

carnaval mentioned this pull request Oct 16, 2014

Generational behavior for the garbage collector #8699

Merged

carnaval closed this Oct 16, 2014

carnaval added 17 commits October 16, 2014 15:29

Add incremental GC & write barrier.

9187c4e

This also includes various allocation changes which should improve performances. There is also a start of generational behavior for <2k objects. This broke the heuristics in the process, still pretty much a WIP.

All wb are now backward for quick collections, big objects can be tra…

894fc4b

…nsients, bugfixes... Still in a pretty broken state (at the very least incremental codepath isn't working) @trrousse :-)

working version of promotion at sweep

a6bd839

fix darwin build & start to cleanup

755581c

Fix a bug where a soon-to-be-promoted object would escape the write b…

f0a78a7

…arrier. A bit more cleanup too. Also some missing write barrier in new code.

repair timing & memory stats

52520f4

count external memory alloc again

611a471

add peak resident memory to perf tests, slight heuristic adjustment, …

10a32ff

…no there yet.

Prevent module level assignments from clobbering up the remembered se…

977fa6c

…t. A bit more tweaking of the collection heuristics. We are now faster/less memory hungry on almost every benchmark of the micro, kernel & shootout suite.

Slight allocation optimizations. Also another cleanup round.

5c9ec6c

More cleanup

9d4568d

Avoid a silly 1-cycle latency to old object remarking after a full co…

8dd9c91

…llection. Slight cleanup. Address some of Jeff's comments.

cleanup some defines. remove useless geptr instruction. repair GC_FIN…

0419d5a

…AL_STATS.

yet another round of cleanups + some additional comments.

c238705

add some more timing output. improve a bit the page allocator.

e47f5e1

oups

1a0b706

remove gc_inc, add more comments

7447ccb

vtjnash reviewed Mar 11, 2015
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Incremental GC #5227

WIP: Incremental GC #5227

carnaval commented Dec 23, 2013

JeffBezanson commented Dec 23, 2013

StefanKarpinski commented Dec 23, 2013

JeffBezanson commented Dec 23, 2013

StefanKarpinski commented Dec 23, 2013

carnaval commented Dec 23, 2013

JeffBezanson commented Dec 23, 2013

carnaval commented Dec 23, 2013

JeffBezanson commented Dec 24, 2013

StefanKarpinski commented Dec 24, 2013

ViralBShah commented Dec 24, 2013

carnaval commented Dec 26, 2013

ssfrr commented Jan 1, 2014

JeffBezanson commented Jan 1, 2014

carnaval commented Jan 21, 2014

ssfrr commented Jan 22, 2014

vtjnash commented Jan 29, 2014

carnaval commented Jan 29, 2014

carnaval commented Jan 29, 2014

timholy commented Jan 29, 2014

JeffBezanson commented Jan 29, 2014

timholy commented Jan 29, 2014

ViralBShah commented Jan 30, 2014

ssfrr commented May 24, 2014

StefanKarpinski commented May 24, 2014

carnaval commented Oct 5, 2014

JeffBezanson commented Oct 5, 2014

jakebolewski commented Oct 6, 2014

JeffBezanson commented Oct 6, 2014

JeffBezanson commented Oct 14, 2014

carnaval commented Oct 16, 2014

ihnorton commented Oct 16, 2014

vtjnash Mar 11, 2015

		}


		static inline int gc_setmark(void *o, int sz, int mark_mode)

WIP: Incremental GC #5227

WIP: Incremental GC #5227

Conversation

carnaval commented Dec 23, 2013

JeffBezanson commented Dec 23, 2013

StefanKarpinski commented Dec 23, 2013

JeffBezanson commented Dec 23, 2013

StefanKarpinski commented Dec 23, 2013

carnaval commented Dec 23, 2013

JeffBezanson commented Dec 23, 2013

carnaval commented Dec 23, 2013

JeffBezanson commented Dec 24, 2013

StefanKarpinski commented Dec 24, 2013

ViralBShah commented Dec 24, 2013

carnaval commented Dec 26, 2013

ssfrr commented Jan 1, 2014

JeffBezanson commented Jan 1, 2014

carnaval commented Jan 21, 2014

ssfrr commented Jan 22, 2014

vtjnash commented Jan 29, 2014

carnaval commented Jan 29, 2014

carnaval commented Jan 29, 2014

timholy commented Jan 29, 2014

JeffBezanson commented Jan 29, 2014

timholy commented Jan 29, 2014

ViralBShah commented Jan 30, 2014

ssfrr commented May 24, 2014

StefanKarpinski commented May 24, 2014

carnaval commented Oct 5, 2014

JeffBezanson commented Oct 5, 2014

jakebolewski commented Oct 6, 2014

JeffBezanson commented Oct 6, 2014

JeffBezanson commented Oct 14, 2014

carnaval commented Oct 16, 2014

ihnorton commented Oct 16, 2014

vtjnash Mar 11, 2015

Choose a reason for hiding this comment