Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added parallel_map_for_each #9

Closed
wants to merge 1 commit into from
Closed

Added parallel_map_for_each #9

wants to merge 1 commit into from

Conversation

safinaskar
Copy link

Sometimes one want to invoke parallel_map while borrowing local data, but don't want to mess with scoped threads. So I created this function. It allows one to call map(...).for_each(...), but allows to borrow data without need to create scoped thread explicitly. Here it is:

    fn parallel_map_for_each<Map, ForEach, O>(
        self,
        m: Map,
        f: ForEach,
    ) -> Result<(), Box<dyn Any + Send + 'static>>
    where
        Self: Sized,
        Self: Iterator,
        Map: Send + Clone,
        Self::Item: Send,
        Map: FnMut(Self::Item) -> O,
        O: Send,
        ForEach: FnMut(O),
    {
        crossbeam::thread::scope(move |s| {
            self.parallel_map_scoped(s, m).for_each(f);
        })
    }

Here is somewhat similar request: #7 .

Such function may be useful if you process something in parallel and then write data to file. This is exactly case I faced. Well, personally I am smart enough to call crossbeam::thread::scope explicitly, but I still think there may be people, who don't want to mess with scoped threads

@dpc
Copy link
Owner

dpc commented Jul 31, 2023

I have a mixed feelings about it. I understand the idea to help the users confused with borrowing and lifetimes, but ...

First, it's unclear why this wouldn't be self.parallel_map_scoped(s, m).parallel_for_each(f); or even specialized version that would avoid sorting overhead altogether (by doing each map and for each together in a thread).

parallel_for_each is actually missing... hmm... I guess it doesn't make sense to have parallel_map_for_each as it can be done just with parallel_for_each.

Second... it does increase the API surface, making it (very slightly) harder to find other stuff, maintain and sets a precedence: why not parallel_filter_for_each etc...

So after a little debate with myself, I don't think this should land. But thanks for the contribution.

If you can think of any way to improve the documentation to help users finding and figuring out scoped version, that would be great.

@safinaskar
Copy link
Author

Thanks for answer! Okay, I'm closing this PR. And I will delete the branch.

Also, I created my analog of rdedup. And I did benchmark, which compares my dedupper to yours and to many others. My dedupper turned out to be fastest by large margin. But I'm cheating by using fixed sized chunking. See borgbackup/borg#7674 . Unfortunately, rdedup performs not so good even compared to CDC-based deduppers ( borgbackup/borg#7674 (comment) ). As you can see, desync beats rdedup in speed by large margin in load operation.

Also, while creating my chunker (azwyon) I created my analog of pariter (I was not aware pariter exists): rayon-rs/rayon#1071 . It is based on rayon. Later I converted my azwyon to pariter and now azwyon performs 4% slower (but this is minor issue)

@safinaskar safinaskar closed this Jul 31, 2023
@safinaskar safinaskar deleted the map_for_each branch August 16, 2023 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants