Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Allow one to bound the size of output shards when writing to files. #22129

Closed
robertwb opened this issue Jul 1, 2022 · 0 comments · Fixed by #22130
Closed

Comments

@robertwb
Copy link
Contributor

robertwb commented Jul 1, 2022

What would you like to happen?

By default, Beam automatically chooses the number of shards to write, but there are cases where these shards are either significantly larger or smaller than desired. Users can manually set the number of shards, but this comes at a(n often significant) performance cost as it forces all data to be reshuffled, and also requires guessing at the total output size to determine the correct number of shards.

While creating two many small shards may be difficult to rectify (distinct bundles cannot write to the same shard), writing multiple shards in a single bundle is a straightforward way to enforce an upper bound on shard sizes.

Issue Priority

Priority: 2

Issue Component

Component: io-py-common

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant