Enable async processing for SDF on Spark runner #23852 #24837

JozoVilcek · 2022-12-30T09:32:08Z

Enables SparkRunner to process SDF functions which can generate large output ( such as ParquetIO ) without the need to fit outuup to memory.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

JozoVilcek · 2022-12-30T10:13:17Z

R: @je-ik
R: @aromanenko-dev
R: @mosche

github-actions · 2022-12-30T10:14:31Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

mosche · 2022-12-30T10:28:19Z

Thanks @JozoVilcek, that's awesome. I won't get to this today, but I'll have a look early next week!

je-ik

I think the approach is correct overall, there are a few questions and mostly, we need to test this against complete @ValidatesRunner suite. Plus it would be good to test the iterators independently using unit tests.

runners/spark/src/main/java/org/apache/beam/runners/spark/SparkCommonPipelineOptions.java

...s/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkInputDataProcessor.java

mosche

Thanks so much for your contribution @JozoVilcek!
I've added a few more comments. Note, I've also pointed out a few things that are not issues of this PR, but existed before. Feel free to ignore those if you prefer.

I'm still scratching my head a bit about naming of the different components (lol, as always 🙈). But to be fair I also can't propose anything that feels more intuitive ...

...s/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkInputDataProcessor.java

mosche · 2023-01-04T11:04:03Z

runners/spark/src/main/java/org/apache/beam/runners/spark/translation/MultiDoFnFunction.java

@@ -180,14 +186,14 @@ public TimerInternals timerInternals() {
    DoFnRunnerWithMetrics<InputT, OutputT> doFnRunnerWithMetrics =
        new DoFnRunnerWithMetrics<>(stepName, doFnRunner, metricsAccum);

-    return new SparkProcessContext<>(
+    SparkProcessContext<Object, InputT, OutputT> ctx =
+        new SparkProcessContext<>(


Just throwing it out here as option to consider ... considering there's already a lot of moving pieces involved and the context has just become a container without functionality, everything in there except for the runner could already be passed to the processor when initializing it (some for the input iterator as well). The only thing that then has to be passed to process is the runner itself.

It actually can not be put into constructor, because processor is responsible for providing correct instance of output manager which is needed doFnRunner construction. Therefore I did choose to wrap these into container and pass to processor as context object.

because processor is responsible for providing correct instance of output manager which is needed doFnRunner construction

That's more or less what I meant, you can but everything into the constructor of the processor except the runner itself. E.g. process could look like this then:

processor.process(iter, doFnRunnerWithMetrics)

or even as below if you pass the input iterator into the constructor as well.

processor.process(doFnRunnerWithMetrics)

@JozoVilcek Please feel free to discard or ignore! This code here is quickly hacked together and just meant to demonstrate what I had in mind above. SparkInputDataProcessor was replaced by SparkOutputManager, no SparkProcessContext and output iterator needed anymore.

JozoVilcek · 2023-01-05T16:29:00Z

@je-ik , @mosche I hope I did address all suggested changes except unit tests for iterators ( in my todo list for later ). Validates runner did pass for me locally.

mosche · 2023-01-05T18:44:31Z

Thanks so much @JozoVilcek. I'm off tomorrow, but I'll have a look on Monday.

mosche · 2023-01-05T18:44:58Z

Run Java PreCommit

mosche · 2023-01-05T18:46:01Z

Run Spark ValidatesRunner

mosche

@JozoVilcek Thanks so much for addressing all the comments, even the existing tech debt i pointed out 🎉 A couple of small things left, please have a look.
@je-ik Anything else left from your side?

mosche · 2023-01-10T15:22:42Z

runners/spark/src/main/java/org/apache/beam/runners/spark/translation/MultiDoFnFunction.java

@@ -80,6 +77,7 @@
  private final boolean stateful;
  private final DoFnSchemaInformation doFnSchemaInformation;
  private final Map<String, PCollectionView<?>> sideInputMapping;
+  private final boolean useBoundedOutput;


nit, maybe even useBoundedConcurrentOutput or useBoundedParallelOutput?

...s/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkInputDataProcessor.java

mosche · 2023-01-10T15:43:52Z

...s/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkInputDataProcessor.java

+
+      if (outputProducerTask == null) {
+        outputProducerTask =
+            executorService.submit(


Please move this initialization code into a private helper startOutputProducerTask() or similar for better readability.

...s/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkInputDataProcessor.java

mosche · 2023-01-10T17:16:41Z

@JozoVilcek also note, there's one conflict with the master branch

runners/spark/spark_runner.gradle

...s/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkInputDataProcessor.java

JozoVilcek · 2023-01-18T12:18:58Z

Hi @mosche , just a quick feedback, I appreciate all the comments and suggestion. Ido plan to address them all by beginning of next week. Apologies for delay due to some other priorities on my side.

mosche · 2023-01-18T12:54:27Z

Thanks, no worries @JozoVilcek :)

…kCommonPipelineOptions.java Co-authored-by: Jan Lukavský <je.ik@seznam.cz>

…slation/SparkInputDataProcessor.java Co-authored-by: Jan Lukavský <je.ik@seznam.cz>

JozoVilcek · 2023-01-31T12:58:53Z

Run Java PreCommit

JozoVilcek · 2023-01-31T12:59:09Z

Run Portable_Python PreCommit

JozoVilcek · 2023-01-31T12:59:37Z

Run Spark ValidatesRunner

JozoVilcek · 2023-01-31T15:40:05Z

Run Spark ValidatesRunner

mosche · 2023-02-01T14:51:23Z

Run Portable_Python PreCommit

mosche

LGTM, thanks a lot @JozoVilcek 🎉

Just two final optional suggestions:

How about naming the experiment as done in MultiDoFnFunction, so
use_bounded_concurrent_output_for_sdf instead of use_bounded_output_for_sdf?
And how about mentioning the experiment as improvement in CHANGES.md with instructions how to enable?

But I'm also happy to merge as is, let me know.

mosche · 2023-02-02T07:31:13Z

Run Spark ValidatesRunner

mosche · 2023-02-02T08:53:31Z

Thanks so much @JozoVilcek, merged 🎉

github-actions bot added core runners spark labels Dec 30, 2022

je-ik reviewed Jan 2, 2023

View reviewed changes

mosche reviewed Jan 4, 2023

View reviewed changes

mosche reviewed Jan 10, 2023

View reviewed changes

runners/spark/spark_runner.gradle Show resolved Hide resolved

mosche reviewed Jan 11, 2023

View reviewed changes

runners/spark/spark_runner.gradle Outdated Show resolved Hide resolved

mosche reviewed Jan 11, 2023

View reviewed changes

...s/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkInputDataProcessor.java Outdated Show resolved Hide resolved

Jozef Vilcek and others added 9 commits January 27, 2023 14:22

Enable async processing for SDF on Spark runner apache#23852

4fda1a5

Apply spotless

d25ab05

Checkstyle fixes

f45b048

Fix checkerframework errors

1ecd791

Update runners/spark/src/main/java/org/apache/beam/runners/spark/Spar…

26f2333

…kCommonPipelineOptions.java Co-authored-by: Jan Lukavský <je.ik@seznam.cz>

Update runners/spark/src/main/java/org/apache/beam/runners/spark/tran…

b1a0972

…slation/SparkInputDataProcessor.java Co-authored-by: Jan Lukavský <je.ik@seznam.cz>

Improvements from code review

6b923b0

Replace MultiMap with ArrayDeque

bc9e31b

Code review suggestions

3a52456

JozoVilcek force-pushed the spark-async-support-for-multido-fn branch from e6523ea to 3a52456 Compare January 31, 2023 11:35

Fix guava use

fa04157

Fix spotbug

b811315

JozoVilcek requested review from mosche and je-ik and removed request for mosche and je-ik January 31, 2023 15:40

mosche approved these changes Feb 1, 2023

View reviewed changes

Jozef Vilcek added 2 commits February 1, 2023 21:03

Rename experiment and add mention into changes.md

e37ecf3

Spotless + whitespace fix

93d11f9

mosche merged commit 01aa470 into apache:master Feb 2, 2023

JozoVilcek deleted the spark-async-support-for-multido-fn branch February 2, 2023 11:40

damccorm mentioned this pull request Feb 2, 2023

Move changes to correct release in CHANGES.md #25288

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable async processing for SDF on Spark runner #23852 #24837

Enable async processing for SDF on Spark runner #23852 #24837

JozoVilcek commented Dec 30, 2022

JozoVilcek commented Dec 30, 2022

github-actions bot commented Dec 30, 2022

mosche commented Dec 30, 2022

je-ik left a comment

mosche left a comment

mosche Jan 4, 2023

JozoVilcek Jan 5, 2023

mosche Jan 5, 2023

mosche Jan 11, 2023

JozoVilcek commented Jan 5, 2023

mosche commented Jan 5, 2023

mosche commented Jan 5, 2023

mosche commented Jan 5, 2023

mosche left a comment

mosche Jan 10, 2023

JozoVilcek Jan 31, 2023

mosche Jan 10, 2023

JozoVilcek Jan 31, 2023

mosche commented Jan 10, 2023

JozoVilcek commented Jan 18, 2023

mosche commented Jan 18, 2023

JozoVilcek commented Jan 31, 2023

JozoVilcek commented Jan 31, 2023

JozoVilcek commented Jan 31, 2023

JozoVilcek commented Jan 31, 2023

mosche commented Feb 1, 2023

mosche left a comment

mosche commented Feb 2, 2023

mosche commented Feb 2, 2023

Enable async processing for SDF on Spark runner #23852 #24837

Enable async processing for SDF on Spark runner #23852 #24837

Conversation

JozoVilcek commented Dec 30, 2022

GitHub Actions Tests Status (on master branch)

JozoVilcek commented Dec 30, 2022

github-actions bot commented Dec 30, 2022

mosche commented Dec 30, 2022

je-ik left a comment

Choose a reason for hiding this comment

mosche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JozoVilcek commented Jan 5, 2023

mosche commented Jan 5, 2023

mosche commented Jan 5, 2023

mosche commented Jan 5, 2023

mosche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mosche commented Jan 10, 2023

JozoVilcek commented Jan 18, 2023

mosche commented Jan 18, 2023

JozoVilcek commented Jan 31, 2023

JozoVilcek commented Jan 31, 2023

JozoVilcek commented Jan 31, 2023

JozoVilcek commented Jan 31, 2023

mosche commented Feb 1, 2023

mosche left a comment

Choose a reason for hiding this comment

mosche commented Feb 2, 2023

mosche commented Feb 2, 2023