[Discussion] Is it possible to impl `IntoIterator` for `ParallelIterator`? #1179

ZhennanWu · 2024-07-07T22:52:35Z

This is a realistic need surfaced during development of EPGI, a parallel GUI library using rayon. I'll explain the motivation step-by-step

Motivation

Variable parallelism in GUI framework

EPGI uses rayon in various pipeline stages, and the most common way to use rayon is to visit a node's children in parallel. However, child count across the node tree is hugely variable. It is known that different rayon parallelism primitive (or simply non-rayon serial code) have vastly different performance characteristics regarding child count.

For dynamic children widgets, it calls for a runtime selection mechanism for parallelism. A Column widget painting a layout may have two children but get nested 100 times deep, while another Column widget painting a data science visualziation may have no nesting but have 10k children instead. Both of these cases are computationally heavy and need respective optimization.

Limitation of runtime selection mechanism API designs

Given a Vec, a straightforward runtime selection mechanism is impl-ed as:

    fn par_for_each_vec<T: Send, F: Fn(T) + Send + Sync>(threadpool: &ThreadPool, mut vec: Vec<T>, f: F) {
        match vec.len() {
            0 => {}
            1 => f(vec.remove(0)),
            2..=16 => {
                self.scope(|s| {
                    let f_ref = &f;
                    for elem in vec {
                        s.spawn(move |_| f_ref(elem));
                    }
                });
            }
            _ => threadpool.install(|| vec.into_par_iter().for_each(f)),
        };
    }

We can duplicate this for for_each/map/unzip and for Vec/&Vec/&mut Vec. No problem for now.

However, problem surfaces as we went into zip and MultiZip. We will provide an example below

Suppose we have a widget that could have a huge number of children. During its implementation we need to use rayon's MultiZip to zip three &Vecs. Now we want to introduce a runtime selection version

    fn par_zip3_ref_for_each_vec<
        T1: Sync,
        T2: Sync,
        T3: Sync,
        F: Fn(&T1, &T2, &T3) + Send + Sync,
    >(
        threadpool: &ThreadPool
        vec1: &Vec<T1>,
        vec2: &Vec<T2>,
        vec3: &Vec<T3>,
        f: F,
    ) {
        let len = std::cmp::min(std::cmp::min(vec1.len(), vec2.len()), vec3.len());
        match len {
            0 => {}
            1 => f(&vec1[0], &vec2[0], &vec3[0]),
            2..16 => {
                threadpool.scope(|s| {
                    let f_ref = &f;
                    for ((elem1, elem2), elem3) in std::iter::zip(std::iter::zip(vec1, vec2), vec3)
                    {
                        s.spawn(move |_| f_ref(elem1, elem2, elem3));
                    }
                });
            }
            _ => threadpool.install(|| {
                (vec1, vec2)
                    .into_par_iter()
                    .for_each(|(elem1, elem2)| f(elem1, elem2))
            })
        }
    }

Seems fine. But this only handles (&Vec, &Vec, &Vec) case. What about (Vec, &mut Vec, &Vec)? We get an combinatorial explosion! And it is just for a zip-3!

As a legit real-world exmaple, here are two pieces of code in EPGI that is waiting to be parallelized. They both use different versions of zip-3.
https://github.com/ZhennanWu/epgi/blob/850ed7556bb8d619b430f337a0c90da05de72de8/epgi-common/src/basic/stack.rs#L216-L219

https://github.com/ZhennanWu/epgi/blob/850ed7556bb8d619b430f337a0c90da05de72de8/epgi-common/src/basic/stack.rs#L238-L241

Switching between `Iterator` and `ParallelIterator`

One way to circumvent this issue is to somehow merge the capabilities of serial iterators and parallel iterators. Requiring IntoIterator as the supertrait of IntoParallelIterator seems implausible, since rayon's IntoParallelIterator is implemented for much more primitive types than std's IntoIterator, and we can't force feed the std. However, I think it is quite possible to make ParallelIterator: IntoIterator.

Then we can (psuedocode)

let it = data.into_par_iter();
if len < 16 {
    threadpool.scope(|s|{
        it.into_iter().for_each(|elem| s.spawn(move |_|f(elem)))
    })
} else {
    threadpool.install(|| it.for_each(f));
}

I wonder if this change is possible or is there any other practical concerns with regard to parallel iterators designs

(Note: of course the above examples can be worked around by introducing intermediate Vecs. But that would lose the purpose of tuning the performance in the first place. This is an experimental project so I want to see how far it can go w.r.t. zero-cost abstraction).

The text was updated successfully, but these errors were encountered:

cuviper · 2024-07-08T18:01:19Z

So to be clear, you are not asking for a parallel iterator with serialized output, like #210 or #858. You just want to avoid parallel overhead for smaller inputs?

IMO, the difference between rayon_spawn and rayon_par_iter at the low end is not significant enough to worry about, at least as far as your linked graphs show. I would just use par_iter for all the parallel work.

However, I think it is quite possible to make ParallelIterator: IntoIterator.

It would certainly be a breaking change.

For IndexedParallelIterator, you can get fully serialized behavior by calling .with_min_len(usize::MAX). This lowers to a regular Iterator internally once it gets to Producer::into_iter, but you won't be interacting at this level to use any other Iterator APIs.

You can tune that however you like. Maybe you'll set a known serial/parallel threshold like .with_min_len(4) -- this would have the effect that it won't split parallel at all up to len 7, and then len 8 will only split once, etc. Or you could make it more dynamic like:

indexed_par_iter
    .with_min_len(if len < 16 {
        usize::MAX // no parallelism
    } else {
        1 // full parallelism
    })
    .for_each(f);

... or some gradient therein, as you find makes sense for your workload.

To some extent, you can write your zip variations based on I: IndexedParallelIterator, as long as you also use F: Fn(I::Item) without trying to expand the zipped tuple as separate arguments. Even apart from rayon, Rust doesn't have a good way to deal with that otherwise, lacking support in generics for variadics and different mutability.

ZhennanWu · 2024-07-08T21:17:30Z

IMO, the difference between rayon_spawn and rayon_par_iter at the low end is not significant enough to worry about,

This would be true for most of the workloads. But GUI library with a virtual-dom layer is another area. The most notorious case is perhaps Google's Flutter. In realistic Flutter apps, you will frequently see widget tree up to hundreds of levels deep. If at each node (there would be numerous of nodes throughout the tree), it spins up the whole par_iter subdivide saga, then it would be a problem.

But to be fair, I currently don't have the test case to test this theory. So it may not be that impactful in the end

you can get fully serialized behavior

This is not entirely what I want. The reason to get a serialized iterator (besides switching to fully serialized behavior) is switching to rayon::spawn. rayon::spawn spawns in a serailized fashion, you know.

The reason to use spawn is that GUI widget tree is inherently very unbalanced. And parallel iterator is known to cause blocking when its tasks are unbalanced (#1054) . So par_iter is not very suitable for most of the case. But, we would like to use it for the worst possible child count. Hence the switching.

.with_min_len(if len < 16 {
    usize::MAX // no parallelism
} else {
    1 // full parallelism
})

Brilliant! I'll certainly employ this idea to my other projects.

cuviper · 2024-07-08T22:11:22Z

If at each node (there would be numerous of nodes throughout the tree), it spins up the whole par_iter subdivide saga, then it would be a problem.

I think that par_iter "saga" is not actually as heavy as you imagine, while spawn requires an allocation (HeapJob) for every single call, as well as the counter overhead shown in #1054.

Thanks for the reminder about those completion issues though. We should find a way to make that opt-in for_each_spawn idea work in a "clean" way, or at least "not too dirty"...

For a deep widget tree, I wonder if it's really worth parallelizing the deep/smaller parts at all? It might be fine to let inner parts run sequentially while the outer parts are divided in parallel. This will mostly happen naturally anyway with both par_iter nesting and scope+spawn, until work-stealing gets to the inner stuff.

ZhennanWu closed this as completed Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Is it possible to impl `IntoIterator` for `ParallelIterator`? #1179

[Discussion] Is it possible to impl `IntoIterator` for `ParallelIterator`? #1179

ZhennanWu commented Jul 7, 2024 •

edited

Loading

cuviper commented Jul 8, 2024

ZhennanWu commented Jul 8, 2024 •

edited

Loading

cuviper commented Jul 8, 2024

[Discussion] Is it possible to impl IntoIterator for ParallelIterator? #1179

[Discussion] Is it possible to impl IntoIterator for ParallelIterator? #1179

Comments

ZhennanWu commented Jul 7, 2024 • edited Loading

Motivation

Variable parallelism in GUI framework

Limitation of runtime selection mechanism API designs

Switching between Iterator and ParallelIterator

cuviper commented Jul 8, 2024

ZhennanWu commented Jul 8, 2024 • edited Loading

cuviper commented Jul 8, 2024

[Discussion] Is it possible to impl `IntoIterator` for `ParallelIterator`? #1179

[Discussion] Is it possible to impl `IntoIterator` for `ParallelIterator`? #1179

ZhennanWu commented Jul 7, 2024 •

edited

Loading

Switching between `Iterator` and `ParallelIterator`

ZhennanWu commented Jul 8, 2024 •

edited

Loading