-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FlatMap, Flatten appear to optimize badly #87411
Comments
@adrian17 would it be possible to license those benchmarks in a way that's compatible with the rustc licenses so they can be included in the std benchmarks? |
Looking at the last few benchmarks, it would appear that the main issue is that flatten doesn't vectorize well: I didn't check out the other ones, but there may be 2 issues, one where flatten doesn't vectorize well, and the other where collect cannot preallocate the correct amount. As for the original change, imho converting everything into an iterator probably is not more idiomatic and is likely less efficient especially when already working with slices, because it's relying on propagation of traits in the iterators which may not all be implemented, and also the compiler to optimize through more abstract levels. That being said, the perf regression is surprising, and worth digging into. Edit: When looking at |
actually, this is expected. https://medium.com/@veedrac/rust-is-slow-and-i-am-the-cure-32facc0fdcb |
At least in the array case the flatten version can be made faster than the collect loops
The nested data structure is a much tougher nut to crack since a good size hint isn't available and internal iteration doesn't have much of an advantage due to reallocation checks needed inside the loop. It basically uses the fallback code path and previous attempts to improve it also lead to compile time regressions. |
My bad, forgot to attach the MIT license :) Done. |
This will be addressed by #87431
These are just bad practice. It's a large collection of many tiny vecs on the heap which means the CPU spends more time chasing pointers and checking the dynamic length than moving data. The flatten performance is not ideal but throwing additional compiler flags at the problem shrinks the difference to a point where I don't think it's worth to optimize it. Picking better data structures is a much lower-hanging fruit.
This one is the most difficult one. In principle this is quite optimizable, 5x speedups should be easily possible, maybe 20x in some scenarios. But it might need wider changes than just a specialization or implementing another method. Changing I'm not sure how big the appetite for more optimization traits is among the libs team. @adrian17 does this represent a real use or is it merely an extrapolation from the initial report? |
Indeed, most of these cases I added artificially for wider comparison. That said, I've seen cases of iteration looking like |
implement fold() on array::IntoIter to improve flatten().collect() perf With rust-lang#87168 flattening `array::IntoIter`s is now `TrustedLen`, the `FromIterator` implementation for `Vec` has a specialization for `TrustedLen` iterators which uses internal iteration. This implements one of the main internal iteration methods on `array::Into` to optimize the combination of those two features. This should address the main issue in rust-lang#87411 ``` # old test vec::bench_flat_map_collect ... bench: 2,244,024 ns/iter (+/- 18,903) # new test vec::bench_flat_map_collect ... bench: 172,863 ns/iter (+/- 2,141) ```
@adrian17 next nightly should have the fix, can you test it? |
Yeah, it looks good now, thank you! I'm still feeling like
Could you please link where these discussions were having place? |
Great, then I'll close this one.
If it becomes an issue in real code that'll be worth a separate issue. Currently it's a theoretical concern and may not be worth the necessary optimization work.
I asked on zulip in a discussion relating to other iterator optimizations and was pointed to this comment rust-lang/rfcs#3058 |
Context: I recently saw a perf regression caused by the following change:
The change was reasonable, with assumption that this would be more idiomatic and guarantee that the output Vec is preallocated. Unfortunately, this made the function several times slower. Recently-merged #87168 helped here slightly, but the generated code is still much slower.
The same can also be observed for other code using
flat_map
orflatten
, like a simple iteration over the iterator, and regardless of whether the flattened type has known size (array) or not.I made an example benchmark in repo https://github.com/adrian17/flat_map_perf , with the following results on my machine (with today's nightly rustc):
The text was updated successfully, but these errors were encountered: