-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Update num_rows_per_file to min_rows_per_file #49978
Conversation
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
from ray.data._internal.util import ( | ||
AllToAllAPI, | ||
ConsumptionAPI, | ||
_validate_rows_per_file_args, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why _
prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a weak indicator for privacy -
Nice! Seems more clear with this change. |
@property | ||
def num_rows_per_write(self) -> Optional[int]: | ||
def min_rows_per_write(self) -> Optional[int]: | ||
"""The target number of rows to pass to each :meth:`~ray.data.Datasink.write` call. | ||
|
||
If ``None``, Ray Data passes a system-chosen number of rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Datasink
is a developer API. I doubt anyone is overriding this method, but there is a chance this could break third-party implementations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense... but do keep in mind:
Developer APIs are lower-level methods explicitly exposed to advanced Ray users and library developers. Their interfaces may change across minor Ray releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this'll probably break Lance's implementation again (we just broke it recently #47942 (comment)): https://github.com/lancedb/lance/blob/9c08d2d1ca2ff72bd1c2ea4a84400a7b655b2172/python/python/lance/ray/sink.py#L234-L236.
Maybe let's maintain backwards compabaility? cc @raulchen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, the semantics for their parameter there is literally the opposite of ours (ours is min, theirs is max)
Aside from that -- perhaps we can keep this method interface around and log a warning for deprecation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, now I think about it -- this doesn't actually break the implementation right? * In terms of behavior, their expected behavior was different than what we provided
- In terms of actual code correctness, the new state will be as if they didn't implement min_rows_per_write, which will then return None.
So they'll basically be overwriting a function that doesn't get used, but won't break stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
I trust your judgement on how we handle (or don't handle) the API change
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
## Why are these changes needed? num_rows_per_file makes it seem like there is "exact" semantics, whereas the actual underlying implementation is more like "at least". This updates the parameter name to reflect that, but we should double check whether or not this is the best way to expose this. Closes ray-project#45393 ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Anson Qian <anson627@gmail.com>
## Why are these changes needed? num_rows_per_file makes it seem like there is "exact" semantics, whereas the actual underlying implementation is more like "at least". This updates the parameter name to reflect that, but we should double check whether or not this is the best way to expose this. Closes ray-project#45393 ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Anson Qian <anson627@gmail.com>
## Why are these changes needed? num_rows_per_file makes it seem like there is "exact" semantics, whereas the actual underlying implementation is more like "at least". This updates the parameter name to reflect that, but we should double check whether or not this is the best way to expose this. Closes #45393 ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
## Why are these changes needed? num_rows_per_file makes it seem like there is "exact" semantics, whereas the actual underlying implementation is more like "at least". This updates the parameter name to reflect that, but we should double check whether or not this is the best way to expose this. Closes ray-project#45393 ## Related issue number <!-- For example: "Closes ray-project#1234" --> ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Puyuan Yao <williamyao034@gmail.com>
Why are these changes needed?
num_rows_per_file makes it seem like there is "exact" semantics, whereas the actual underlying implementation is more like "at least".
This updates the parameter name to reflect that, but we should double check whether or not this is the best way to expose this.
Closes #45393
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.