Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track first free atbs for multiple block sizes instead of just 1 #2614

Merged
merged 2 commits into from
Feb 12, 2020

Conversation

tannewt
Copy link
Member

@tannewt tannewt commented Feb 12, 2020

This speeds up code paths that allocate only blocks larger than 1.

I believe imports will be faster because long lived allocations now always update the last free atb by assuming allocations are contiguous on the long lived end of the heap.

@jimmo @dpgeorge I heard you tried this as well but found cases where it was slower. What were they? I've split our py/ patch out into a commit but think it's still too different to apply smoothly in MP since we also have the long-lived stuff.

This speeds up cases where no single block allocations happen while
Python code is allocating other block sizes.
@tannewt tannewt added this to the 5.0.0 milestone Feb 12, 2020
@tannewt tannewt requested a review from dhalbert February 12, 2020 01:11
Copy link
Collaborator

@dhalbert dhalbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I tested a CPB fetching acceleration in a tight loop for 15 minutes with no issues. I won't merge pending any further comments by others.

@jimmo
Copy link

jimmo commented Feb 12, 2020

hi @tannewt -- I looked into this on MicroPython and tried something very similar. There are two main things to solve:

  • Long runs on >1-block allocations means that the pointer never advances. (e.g. we see this when allocating the complex number type on boards that support double-precision hardware floating point, where the complex type ends up taking two blocks).
  • Patterns of allocs paired with frees/reallocs that cause the pointer to reset backwards, but the next allocation doesn't fit in the freed space, so it has to search all the way forward again. As long as you have enough buckets to cover the different sizes, then this PR will fix that.

I implemented something very similar for MicroPython, but the issue we had was that this fix caused a performance regression, and I wasn't sure how common the issue is to warrant a performance hit. I went looking for other alternative fixes instead :)

I don't currently have access to the computer where I was doing these measurements, but I patched this PR into micropython/master (see https://github.com/jimmo/micropython/tree/gc-multi-atb ) and ran the performance tests on it, similar results to what I saw with my changes:

$ ./run-perfbench.py -t ~/mpy/perf/pyb.master ~/mpy/perf/pyb.multi 
diff of microsecond times (lower is better)
N=100 M=100              /home/jimmo/mpy/perf/pyb.master -> /home/jimmo/mpy/perf/pyb.multi         diff      diff% (error%)
bm_chaos.py               166264.50 ->  170207.62 :   +3943.12 =  +2.372% (+/-0.01%)
bm_fannkuch.py             79847.12 ->   82201.12 :   +2354.00 =  +2.948% (+/-0.00%)
bm_fft.py                 311653.12 ->  318420.50 :   +6767.38 =  +2.171% (+/-0.00%)
bm_float.py                51526.62 ->   53355.75 :   +1829.13 =  +3.550% (+/-0.00%)
bm_hexiom.py              287545.25 ->  286712.62 :    -832.63 =  -0.290% (+/-0.00%)
bm_nqueens.py             237672.38 ->  247577.00 :   +9904.62 =  +4.167% (+/-0.00%)
bm_pidigits.py            101768.50 ->  100396.88 :   -1371.62 =  -1.348% (+/-0.22%)
misc_aes.py                88932.25 ->   88190.75 :    -741.50 =  -0.834% (+/-0.01%)
misc_mandel.py            132150.50 ->  139368.12 :   +7217.62 =  +5.462% (+/-0.00%)
misc_pystone.py           154903.00 ->  154500.00 :    -403.00 =  -0.260% (+/-0.00%)
misc_raytrace.py          163106.62 ->  169317.62 :   +6211.00 =  +3.808% (+/-0.01%)

(This is on pybv11).

Note that these performance tests may or may not be representative of real programs, and it's worth nothing that some do in fact get faster. Especially if you're seeing this issue in code using your standard libraries (e.g. sensor reading in a loop), then merging this sounds like easily the right thing to do.

For completeness, some other things I investigated:

  • Some heuristics to trade a little bit of fragmentation (e.g. keep track of how many allocations since hte last pointer move, and update it every N).
  • Similar to the above, keep track of the number of total free blocks found during the search, and advance the pointer if that number is less than N. (i.e. if you were searching for a two-block alloc, but you never actually saw any (or some small number) single blocks in the search, then you should advance the pointer).
  • Reducing the number of buckets, including just having a "single" and a "multi" pointer. Doesn't improve much compared to more buckets.
  • Keeping a stack of movements of the pointer. This helps a lot with the second issue, but it's hard to size the stack correctly, and still has the perf hit.

@tannewt
Copy link
Member Author

tannewt commented Feb 12, 2020

Thanks @jimmo! I bet the perf tests that improved used fewer 1-block allocations than the others. I may play around with them later to see.

@dhalbert Are you ok merging this? I think it'll generally be an improvement and be a drastic improvement in the pathological case.

@tannewt tannewt linked an issue Feb 12, 2020 that may be closed by this pull request
@dhalbert
Copy link
Collaborator

I will merge! I think it's a real-life improvement for a number of cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

with statement in loop takes cyclically increasing time
3 participants