Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes the parallel workload imbalance mentioned in my PSE/Taiko codebase review.
Currently if we take 40 items divided into 12 threads (AMD Ryzen 7800X, Apple M2 Pro or Intel i5-12600) the partitioning will lead to 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 7 = 3*11 + 7 = 40. The “remainder” thread will have 2.33x more work to do.
This can be quite extreme for example if we have 351 items to split on 32 cores, 351/32 = 10.96, rounded to 10 with integer division. 10*31 = 310, the last core needs to process 41 items, 4.1x more than the others.
This ensures whatever the number of items and the number of threads, the workload varies by at most 1.
This probably explains why not all cores are used while benchmarking (benchmarks TODO).
Unsafe
Note: I'm not familiar with Rust standard library and parallelism development since October 2016 (my last foray into writing Rust programs).
I've looked into:
Note of the chunking routines provides balanced partitioning of a range.
chunks_mut
array_chunks
and requires a separateremainder
callas_chunks
I might very well have missed something though.