-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vec's reallocation strategy needs settling #29931
Comments
cc @briansmith |
cc me |
I guess you mean s/usable_capacity/malloc_usable_size/.
I think that's a "nice to have" but not very important. Is real-world code doing
I disagree with this. |
That’s a hack to work around the lack of proper stable API, not how it should be. |
Regardless of what we want to do now, we need to remove the "guarantee" in the nightly docs about capacity, which locks us in. We have no reason to promise all kinds of things about Vec internals. |
CC @glandium |
Maybe instead of actually using usable_size (which has a non-zero cost as you mentioned), the better thing to do would be to just round up to the nearest power-of-two (in bytes) for small allocations and to the nearest MiB for large ones, like mozillas nsTArray does. I have also looked a bit at the code, and at least VecDeque rounds up to the nearest pot (in elements), probably because it does a Btw, just grepping through the rustc code for |
VecDeque needs to maintain its power of two capacity in order to support efficient modular arithmetic (just a bit mask). HashMap does the same. |
FWIW, one can use constant-division tricks to do (relatively) efficient modular arithmetic, if the bit pattern is fixed (e.g. if the capacities are guaranteed to be either (I believe
for some constant |
Oh and they both keep some slack capacity for similar perf reasons. Hashmap will only use 91% of its cap because it keeps access times down. VecDeque keeps an empty slot so that the "two indices" representation can differentiate between the empty and full state without us maintaining some flag that constantly needs to be checked/maintained. I would also like to slightly dissent on the point of Vec being a good malloc. It's an excellent memory-safe and exception-safe interface for acquiring, initializing, deinitializing, and releasing an allocation. It's a bad choice for lower-level uses (all the other collections), of course. |
It's important to note that jemalloc isn't the only allocation that exhibits this behavior (having some slack), actually many do. I assumed we had the trio: reserve, reserve_exact and shrink_to_fit; exactly in order to allow reserve to be highly optimized and do smart reallocation, like in this case. |
I think you've got the wrong larsberg. I believe you want @larsbergstrom |
The system allocator may give more memory than requested, usually rounding up to the next power-of-two. This PR uses a similar logic inside RawVec to be able to make use of this excess memory. NB: this is *not* used for `with_capacity` and `shrink_to_fit` pending discussion in rust-lang#29931
hi, I am learning rust and when I run the attached programs with parameter 100 million in Point size, Rust takes much more memory in vector. Rust compiler generated program (-O) takes 4.5 GB RAM whereas clang++ generated program (-O) takes 3GB. I tried a swift program as well which tool 3 GB around (same as C++). The time command output for rust and c++ respectively is: [Rust 1.5] real 0m25.189s [C++ compiled using clang++ with -O flag] real 0m21.816s Queries:
|
@kajalsinha You might want to specify the capacity when you create the vector (see |
The memory consumption remains 1.5 times even with_capacity |
@kajalsinha your C++ program is using long, Rust is using f64, right? According to google, a long may be 32 bits, I expect f64 is 64 bits. If this is the case, you should be careful with any comparisons, integers are usually faster than floating point even for same size. Caveat: I was never a C++ programmer, and google is occasionally wrong. |
The C/C++ standards define long as a number with no less capacity than an int and no more capacity than a long long. That's it. On Mac OS X (which @kajalsinha is using, based on the console paste), it's one word long: 32 bits on 32-bit OSX and 64 bits on 64-bit OSX. For a fair comparison with Rust, you should use an The fact that you're using integers in C and floating point numbers in Rust also accounts for the performance difference. |
What is the point of even having reserve_exact(), if it provides no additional guarentees over reserve()? |
@Storyyeller |
@ticki what does that mean if it doesn't mean that the capacity can be greater than requested? I believe that |
To make bookkeeping of the memory more smooth, most allocators has some fixed sizes you can allocate from. If the given value is not one of these, it will round up to the nearest fitting size. |
Right, but that doesn't mean that "
|
@ollie27 No, the point is that the capacity is updated with the (uncanonicalised) memory returned by the allocator. |
@gnzlbg The optimal growth factor is probably the golden ratio |
Optimal in what Sense?
Note that one property of a good growth factor is that you are able to
reuse some previously freed memory.
With 1.45 this happens after 3 reallocations, with 1.5 after 4, and in this
PR we have something slightly higher than 1.5. If you don't reallocate 4
times a growth factor of 1.5 doesn't allow you to reuse any memory.
The golden ratio is indeed optimal in a particular sense, but the most
efficient growth ratio is always going to be application dependent. I (and
many others) think 3 / 2 is a good trade off between easy to compute (no
floating point arithmetic necessary), low number of reallocations necessary
to be profitable, and close enough to the golden ratio.
Still, it can't be the best for all applications.
…On Fri 10. Nov 2017 at 19:20, Mamy Ratsimbazafy ***@***.***> wrote:
@gnzlbg <https://github.com/gnzlbg> The optimal growth factor is probably
the golden ratio Φ (1.618...) as discussed here facebook/folly#543
<facebook/folly#543> and in depth in this blog
post
<https://crntaylor.wordpress.com/2011/07/15/optimal-memory-reallocation-and-the-golden-ratio/>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#29931 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3NprUhVqXf1bAc10r5W81vn3RBTGP2ks5s1JPdgaJpZM4GlqS9>
.
|
I strongly advocate for a doubling strategy, as opposed to 1.5x, or the golden ratio, or any other number. As I understand them, the arguments in favour of smaller values are based on a mistaken assumption. Let's assume we use doubling and start at 8 bytes, and then go 16, 32, 64, 128, 256, 512, 1024. For the 8..512 allocations we will have used 1016 bytes, and so we can't reuse those bytes for the 1024 bytes. With a 1.5x growth factor we would theoretically be able to reuse and combine multiple previous freed allocations to satisfy subsequent requests. But in practice we can't, because the above argument assumes in the 1.5x case that these smaller allocations may be laid out in such a way that they can be combined into larger allocations, which is almost certainly not true. jemalloc rounds up requests to various size classes, and allocations of different size classes are put into different runs, where runs are measured in pages; so it is guaranteed that not a single one of those allocations in the 8..512 sequence will be adjacent, and there will be no chance of combining them for such reuse. Other allocators may behave differently, but it would be a very unusual allocator that would lay out such sequences contiguously. So with that mistaken argument out of the way, we can see the arguments in favour of 2x: it's super simple and it's super fast to compute. |
Which is why the proposed strategy doubles up the memory up to the page size [*], and then switches to 1.5x.
A better growth-factor can 1) reduce the number of jemalloc calls, 2) reduce the time spent in these calls, and 3) reduce memory usage. These all can have a measurable performance impact. But whether computing the growth factor takes 1 cycle or 20 is irrelevant compared to the cost of a single jemalloc call: [*] Which is arguably too simple. We could probably do better than this by rounding to the next power of two, so that instead of copying the memory on growth within the same memory pool (such that the older memory can't be reused) jemalloc would instead copy the memory into the next pool freeing the old memory. |
Above the page size, the assumption is still bogus, at least in the case of jemalloc: |
Yes, this is why the proposed strategy only uses 1.5x for medium sized vectors and switches the strategy again for "large" vectors (
Yes,
I think that we might be able to do better than doubling for "large" vectors by tweaking the growth factor, but not much better. The PR switches back to doubling here because this is what we are currently doing. Is there any literature or recommendations from jemalloc / malloc about this? Anyways my plan for large vectors is not to tweak the growth factor between choosing either 1.5 or 2. The main problem with "very large" vectors is the cost of allocating new memory, copying memory (they are very large), and freeing the old memory. On any of the major 64 bit platforms (Linux, Windows, MacOs, ...) we can reduce the cost of this operations to ~0 for very large vectors by, e.g., once a vector becomes "very large" allocating 4 Gb of virtual memory (e.g. using I think that this will have a much larger impact on performance than tweaking the growth factor. |
Actually, C++ vectors always need an actual malloc/copy/free sequence, because
There is no reason for this not to happen with jemalloc, even with larger factors, for sizes between one page and the maximum size for large allocations (chunk size minus a few pages). Except, obviously, if the memory following the current buffer is not free, in which case the original assumption doesn't apply anyways. |
I don't follow the point you are trying to make. Yes, C++ vectors need to do this, but Rust's
Maybe we are talking past each other and you meant something different, but: Suppose you have a contiguous chunk of the address space that is free (before and after this chunk the memory is not free), and that fits the initial content of a vector plus 4x vector regrowths. Now suppose that after the 4th regrowth, the first part of the buffer is still free. The difference between a growth-factor of 1.5x and 2x is that the vector with a growth-factor of 1.5x can grow a fifth time inside that buffer while the one with a growth-factor of 2x never can. This is because, for a growth-factor of 1.5, the size of the previously freed allocations in the buffer is exactly 1.5x the size of the 4th vector, while for a growth-factor of 2x, the size of the previously-freed allocations is always smaller than 2x the size of the 4th vector.
So if the memory following the current buffer is not free, anything better than the worst theoretically-possible growth-factor allows you to grow the vector one extra time before having to go look for a larger buffer. But even if the memory following the current buffer is free, reusing the previously-freed memory reduces fragmentation, and can potentially reduce the number of jemalloc calls as well (if the free size after the 4th vector is returned as an Or where you talking about a completely different situation? |
Now that we are moving to the System allocator by default, I think it makes sense to at least take a look at what the C++ vector implementations do w.r.t the growth factor on the different platforms, since it could make sense that the platforms tune their growth factors for their allocators: libc++System implementation on MacOSX and some *BSDs, code: // Precondition: __new_size > capacity()
template <class _Tp, class _Allocator>
inline _LIBCPP_INLINE_VISIBILITY
typename vector<_Tp, _Allocator>::size_type
vector<_Tp, _Allocator>::__recommend(size_type __new_size) const
{
const size_type __ms = max_size();
if (__new_size > __ms)
this->__throw_length_error();
const size_type __cap = capacity();
if (__cap >= __ms / 2)
return __ms;
return _VSTD::max<size_type>(2*__cap, __new_size);
} The growth factor is 2. Note that on *BSDs jemalloc is the default system allocator, although FBVector is tuned for jemalloc and does a couple of things differently as mentioned above (2 for tiny and large vectors, and 1.5 for the cases in which the allocation hits jemalloc pools). libstdc++System implementation on Linux. The code is here: // Called by _M_fill_insert, _M_insert_aux etc.
size_type
_M_check_len(size_type __n, const char* __s) const
{
if (max_size() - size() < __n)
__throw_length_error(__N(__s));
const size_type __len = size() + std::max(size(), __n);
return (__len < size() || __len > max_size()) ? max_size() : __len;
} The growth factor is 2 (size + size). MSVCWindows. I couldn't find the code only anywhere, but all internet sources point to a factor of 1.5. This factor is also used by FBVector, Boost, etc. Ok so what is going on here? Well, Windows is the outlier, and that makes sense! Windows does not have overcommit, so all the memory On the other hand, for Linux and MacOSX, it might well be that Hell, if one wouldn't be able to disable overcommit in these systems we could just have allocations of large enough vector on 64-bit systems just jump to reserving XGb of RAM and continue with a 2x factor after that. That basically allows realloc to growth the vector in place without any So IMO, if we are going to change this, I'll say change this to whatever each system does. That is, 1.5 on Windows, and 2x on Linux and MacOSX. For targets that have |
I think these are two unrelated things. Windows might not allow to overcommit, but virtual memory is still virtual. According to this article there have been some substantial improvements to how does it manage memory in Windows 8.1 and Windows Server 2012 R2, both released in 2013. The comment by STL:
Was made on Aug 30 2014 and the "last time" mentioned could well be several years before that, so likely prior to the improvements. |
The factor itself (is it 2 or a different number) couldn't be tuned for allocator alone because it doesn't take the size of the element into consideration. The capacities could always be multiples of two, but if the size of the element is 12 bytes, it's always off. Compare with FBVector's initial capacity of I think it's as important not to overestimate the amount of thought put into old systems code as not to underestimate it. I might be wrong but I have an impression that it's pretty much impossible to push any changes to those policies such as vector's growth factor, hash table load factor, etc. in C++ standard library implementations because some users may depend on the existing behaviour and the priority is avoiding regressions rather than improving the average. That's probably why companies like Facebook and Google, albeit having significant power in GCC / Clang development, come along with their independent implementations. Rust still has the luxury (but probably not for long) to reconsider the old defaults and I think it should take advantage of it. That is, I think Rust should pick up the FBVector's strategy (or something similar) on all platforms. |
This doesn't really matter much if you always extend your vector to the
If your allocator only has even-sized or power-of-two-sized bins, a growth factor of two for a vector could make sense, even if this growth factor doesn't take the element size into account. |
If it starts with something like FBVector's |
It's maybe a bit counter intuitive. That is, a Maybe it is a bit clearer with an example, let's suppose let mut v = Vec::with_capacity(8);
// requests 8 * 13 = 104 bytes, allocator bin: 128 bytes
// 128 / 13 = 9
assert_eq!(v.capacity(), 9);
v.extend(0..10); // push 10 elements to grow the vector
// requested capacity: 2 * capacity() = 2 * 9 = 18 => 234 bytes
// allocator bin: 256 bytes => 256 / 13 = 19
assert_eq!(v.capacity(), 19);
v.extend(0..10); // push another 10 elements to grow the vector
// requested capacity: 2 * capacity() = 2 * 19 = 38 => 494 bytes
// allocator bin: 512 bytes => 512 / 13 = 39
assert_eq!(v.capacity(), 39);
// etc. So that is what I meant with, if your Your allocator might not have size-classes, might lack an accurate But even if you don't know anything about your allocator, a power-of-two size class assumptions makes sense. Many allocators use power-of-two size classes, and many types have a power-of-two size due to alignment requirements and fit it perfectly. If your vector starts with one such element and grows, it will do so perfectly, and if it doesn't fit a bin perfectly, it will always grow into the next bin (something that 1.5x growth-factor doesn't give you). A growth factor of 2 is also dirty cheap to compute. If you have an accurate |
I should have clarified that I pointed to the lack of |
One case that would benefit if |
The "use a 1.5 growth strategy" suggestion comes up a lot. However:
Folly's size growth code seems to be similar, today, to what it was, with no exceptional difference. The last fix was to a typo. I almost find that suspicious, as the past 7 years have certainly yielded a plethora of research on allocations and their efficiency, such as mimalloc. |
I found this from https://users.rust-lang.org/t/reserve-truly-exact-capacity-for-vec/109751/19 . Some parts of this discussion are saying things about making |
What actually should be done is that |
Background
Currently, Vec's documentation frequently contains the caveat that "more space may be reserved than requested". This is primarily in response to the fact that jemalloc (or any other allocator) can actually reserve more space than you requested because it relies on fixed size-classes to more effeciently dole out memory. (see the table here)
Jemalloc itself exposes a
malloc_usable_size
function which can be used to determine how much capacity was actually allocated for a pointer, as well asnallocx
which can be used to determine how much capacity will be allocated for a given size and alignment. Vec can in principle query one of these methods and update its capacity field with the result.The question at hand is: is this ever worth it to check
usable_capacity
, and is it ever undesirable?This issue was kicked off by Firefox devs who have experience doing exactly this optimization, and claim it being profitable. Facebook has also claimed excellent dividends in making its allocation strategy more jemalloc friendly, but to my knowledge do not actually query usable_capacity.
Currently our alloction strategy is almost completely naive. One can see the source here, which boils down to:
To the best of my knowledge, this is all the capacity logic for FBVector, which boils down to:
The only major deviation from Rust today being a 1.5 growth factor for "moderate" sized allocations. Note that there is a path in their small_vector type that queries malloc_usable_size.
Unfortunately I've been unable to find good information on what Firefox does here. The best I could find is this discussion which demonstrates big wins -- 6.5% less calls to malloc and 11% less memory usage. However the effort seems to peter out when it is asserted that this is obsoleted by Gecko just using power-of-two capacities. mxr and dxr seem to suggest it's only used for statistics.
Hopefully Firefox and Facebook devs can chime in on any experiences.
Going Forward
I have several outstanding concerns before we push further on this matter.
Rust is not C++, so I am partially suspicious of any benchmarks that demonstrate value in C++. In particular, the proliferation of
size_hint
andextend
could completely blast away any need to be clever with capacities for most programs. I would also expect a large Rust application to frob the allocator less in general just because we default to move semantics, and don't have implicit assignment/copy constructors. Also, Rust programs are much more "reckless" with just passing around pointers into buffers because the borrow checker will always catch misuse. Slices in particular make this incredibly ergonomic. We need Rust-specific benchmarks. I'm sure the Hyper, Servo, Glium, Serde, and Rust devs all have some interesting workloads to look at.Rust's Vec is basically the community's
malloc
/calloc
, because actual malloc is unstable and unsafe. We explicitly support extracting the Vec's pointer and taking control of the allocation. In that regard I believe it is desirable for it to be maximally well-behaved and performant for "good" cases (knowing exactly the capacity you want, having no use for extra capacity). There's a certain value in requesting a certain capacity, and getting the capacity. Bothnallocx
andmalloc_usable_size
are virtual function calls with non-trivial logic, and may have unacceptable overhead for responsible users of Vec.Note that anything we do must enable
Vec
andBox<[T]>
to reliably roundtrip through each other without reallocating in certain cases. If I invokeshrink_to_fit
orwith_capacity
, it ought to not reallocate when I try to convert to aBox<[T]>
. As far as I can tell, this should be possible to uphold even when usingmalloc_usable_size
because jemalloc is "fuzzy" and only requires that the given size is somewhere between the requested one and the usable one.Anything we do must also work with allocators that aren't jemalloc. This may be as simple as setting
usable_size = requested_size
for every allocator that isn't jemalloc.CC @pnkfelix @erickt @reem @seanmonstar @tomaka @SimonSapin @larsberg @pcwalton @Swatinem @nnethercote
CC #29848 #29847 #27627
CC @rust-lang/libs
EDIT(workingjubilee, 2023-05-06): Originally one of the links touched on the master branch of folly instead of a permalink to a specific commit. Linkrot claimed the one to jemalloc's docs. These have been updated with a permalink and a Wayback Machine link, respectively.
The text was updated successfully, but these errors were encountered: