Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-10976] Bundle finalization: E2E support #17045

Merged
merged 7 commits into from
Mar 30, 2022

Conversation

damccorm
Copy link
Contributor

@damccorm damccorm commented Mar 8, 2022

This is a continuation of the effort to add Bundle Finalization started in #16980

Summary of Overall Changes

Bundle finalization enables a DoFn to perform side effects after a runner has acknowledged that it has durably persisted the output. Right now, Java and Python support bundle finalization by allowing a user to register a callback function which is invoked when the runner acknowledges that it has persisted all output, but Go does not have any such support. This is part of a larger change to add support to the Go Sdk as outlined in this design doc.

Summary of Changes in this PR

I completed most of the execution changes in the last pr, leaving this PR to handle the user experience and plumbing the user's bundle finalization parameter through to the execution layer.

Additional testing done

On top of the units added, I also was able to run an E2E example on Dataflow (FWIW, only Dataflow currently has bundle finalization support). In that example, I hijacked the wordcount example and added a bundleFinalizer to write a file to persistent storage for each line that had at least 3 words (chosen pretty randomly to minimize the chances of collisions). I'll omit the whole sample since its long, but it produced a bunch of files like this:

image

This indeed ran after the other data was persisted.

I decided not to add an integration test because of the complexity involved. Because only the dataflow runner supports finalization, any integration test would require finding some way to (a) modify dataflow state and query that state (probably not a great option since it requires modifying a devs personal GCP account, and its not obvious what the best thing to do actually would be without knowing more about their config), (b) creating some sort of local endpoint for dataflow to talk back to (technically feasible, but definitely non-trivial - would also add significant complexity outside of what we're actually testing), or (c) using some 3rd party to communicate between the 2 (non-ideal since it adds an extra dependency that isn't part of what's being tested just for this scenario). I'm definitely open to doing this, but at the moment with the information I have it doesn't feel worth it.

Next Steps

After this, I'll update the documentation here to include an example https://github.com/apache/beam/blob/6438626c059c19ff9ca32cd834d0aa62253e531b/website/www/site/content/en/documentation/programming-guide.md#127-bundle-finalization-bundle-finalization


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@github-actions github-actions bot added the go label Mar 8, 2022
@codecov
Copy link

codecov bot commented Mar 8, 2022

Codecov Report

Merging #17045 (7fd583e) into master (95a5c26) will increase coverage by 0.11%.
The diff coverage is 15.44%.

@@            Coverage Diff             @@
##           master   #17045      +/-   ##
==========================================
+ Coverage   73.75%   73.86%   +0.11%     
==========================================
  Files         667      672       +5     
  Lines       87406    87679     +273     
==========================================
+ Hits        64468    64766     +298     
+ Misses      21831    21792      -39     
- Partials     1107     1121      +14     
Flag Coverage Δ
go 49.56% <15.44%> (+0.63%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdks/go/pkg/beam/core/runtime/exec/fn_arity.go 8.00% <0.00%> (-0.89%) ⬇️
sdks/go/pkg/beam/core/runtime/graphx/serialize.go 26.27% <0.00%> (-0.17%) ⬇️
sdks/go/pkg/beam/core/util/reflectx/calls.go 0.00% <0.00%> (ø)
sdks/go/pkg/beam/forward.go 40.00% <ø> (ø)
sdks/go/pkg/beam/core/runtime/graphx/translate.go 28.81% <25.00%> (-0.02%) ⬇️
sdks/go/pkg/beam/core/funcx/fn.go 57.54% <57.14%> (-0.04%) ⬇️
sdks/go/pkg/beam/core/runtime/exec/fn.go 69.38% <100.00%> (+0.12%) ⬆️
sdks/go/pkg/beam/core/typex/class.go 85.49% <100.00%> (+0.11%) ⬆️
.../python/apache_beam/transforms/periodicsequence.py 96.72% <0.00%> (-1.64%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 95a5c26...7fd583e. Read the comment docs.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 8, 2022

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @youngoli for label go.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @youngoli

@aaltay aaltay requested a review from youngoli March 18, 2022 00:58
@aaltay
Copy link
Member

aaltay commented Mar 18, 2022

@youngoli - would you be able to review this change?
@damccorm - do you know why there is a large drop in code coverage diff?

@damccorm
Copy link
Contributor Author

do you know why there is a large drop in code coverage diff?

@aaltay I think its because out of 306 non-test lines changed, 188 are generated file changes. In particular, reflectx/calls.go is going from 0% coverage to 0% coverage, but its contributing 164 new untested lines. So unless I manually test the generated files (and I'd argue we should be testing the generation logic instead), the diff is guaranteed to drop. Looking at the per-file diff, I don't feel particularly bad about any of the changes.

@aaltay
Copy link
Member

aaltay commented Mar 18, 2022

do you know why there is a large drop in code coverage diff?

@aaltay I think its because out of 306 non-test lines changed, 188 are generated file changes. In particular, reflectx/calls.go is going from 0% coverage to 0% coverage, but its contributing 164 new untested lines. So unless I manually test the generated files (and I'd argue we should be testing the generation logic instead), the diff is guaranteed to drop. Looking at the per-file diff, I don't feel particularly bad about any of the changes.

I agree with you.

I guess it would make sense to not count generated files toward code coverage. (Neither a high priority, nor a blocker for this PR.)

Copy link
Contributor

@lostluck lostluck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is delightfully complete, other than the manual change to a generated file.

#16961 is revamping all the proto generation, which will make things a bit easier for you if you're willing to wait, as it's almost ready to merge.

https://github.com/apache/beam/pull/16961/files#diff-ba774bfc499567af51f1469311fe607b0a41b7e8e57967aa43b53f59c3b4cb0f

I'll also note that the additional bump to the max "optimized" reflectx.Func types is not strictly required. Any user code that's taking in that many parameters is likely doing a large amount of work per element, dominating the reflection overhead. However, no reason not to keep the change as is presently.

Type_KV Type_Special = 11
Type_COGBK Type_Special = 13
Type_WINDOWEDVALUE Type_Special = 14
Type_BUNDLEFINALIZATION Type_Special = 23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never manually update Generated Proto files:

You'll need to add the type to here, and regen the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof - not sure how I missed that this was generated.

I regenerated, if it is convenient to wait for that change to go in that's fine though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lostluck thoughts on taking this now? It will cause a merge conflict, but its inevitable that one of our prs will run into that and this one is ready to go (I think, obviously feel free to review/disagree), and that one has been hanging around for a bit. Would be nice to doc/finish this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with adding it now, I was merely hoping the other PR would get in. Merge conflicts happen, and it's not a hard merge, especially since the PR simplifies the generation step anyway.

@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @lostluck for label go.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@github-actions
Copy link
Contributor

R: @youngoli for final approval

Copy link
Contributor

@youngoli youngoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, with one small nit that I don't think is worth blocking on. Let me know what you think of that suggestion and I can merge after.

Edit: I'll merge after Robert's comment is addressed too. I didn't notice it in my review but it does need to be addressed.

sdks/go/pkg/beam/core/funcx/fn.go Outdated Show resolved Hide resolved
@lostluck
Copy link
Contributor

lostluck commented Mar 19, 2022 via email

Copy link
Contributor

@lostluck lostluck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Feel free to push back on the sort request. We should add it before the 2.39.0 cut at least, and adding it in this PR avoids a potential gap.

Let me know and then I'll merge.

Type_KV Type_Special = 11
Type_COGBK Type_Special = 13
Type_WINDOWEDVALUE Type_Special = 14
Type_BUNDLEFINALIZATION Type_Special = 23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with adding it now, I was merely hoping the other PR would get in. Merge conflicts happen, and it's not a hard merge, especially since the PR simplifies the generation step anyway.

Copy link
Contributor

@lostluck lostluck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Merging! Thanks!

@lostluck lostluck merged commit 9d706e3 into apache:master Mar 30, 2022
@damccorm damccorm deleted the users/damccorm/b-finalization branch March 31, 2022 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants