Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Update num_rows_per_file to min_rows_per_file #49978

Merged
merged 3 commits into from
Jan 23, 2025

Conversation

richardliaw
Copy link
Contributor

@richardliaw richardliaw commented Jan 20, 2025

Why are these changes needed?

num_rows_per_file makes it seem like there is "exact" semantics, whereas the actual underlying implementation is more like "at least".

This updates the parameter name to reflect that, but we should double check whether or not this is the best way to expose this.

Closes #45393

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
from ray.data._internal.util import (
AllToAllAPI,
ConsumptionAPI,
_validate_rows_per_file_args,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why _ prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gvspraveen
Copy link
Contributor

Nice! Seems more clear with this change.

Comment on lines 105 to 109
@property
def num_rows_per_write(self) -> Optional[int]:
def min_rows_per_write(self) -> Optional[int]:
"""The target number of rows to pass to each :meth:`~ray.data.Datasink.write` call.

If ``None``, Ray Data passes a system-chosen number of rows.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datasink is a developer API. I doubt anyone is overriding this method, but there is a chance this could break third-party implementations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense... but do keep in mind:

Developer APIs are lower-level methods explicitly exposed to advanced Ray users and library developers. Their interfaces may change across minor Ray releases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this'll probably break Lance's implementation again (we just broke it recently #47942 (comment)): https://github.com/lancedb/lance/blob/9c08d2d1ca2ff72bd1c2ea4a84400a7b655b2172/python/python/lance/ray/sink.py#L234-L236.

Maybe let's maintain backwards compabaility? cc @raulchen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, the semantics for their parameter there is literally the opposite of ours (ours is min, theirs is max)

Aside from that -- perhaps we can keep this method interface around and log a warning for deprecation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now I think about it -- this doesn't actually break the implementation right? * In terms of behavior, their expected behavior was different than what we provided

  • In terms of actual code correctness, the new state will be as if they didn't implement min_rows_per_write, which will then return None.

So they'll basically be overwriting a function that doesn't get used, but won't break stuff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

I trust your judgement on how we handle (or don't handle) the API change

@richardliaw richardliaw marked this pull request as ready for review January 22, 2025 23:45
@richardliaw richardliaw requested a review from a team as a code owner January 22, 2025 23:45
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
@richardliaw richardliaw added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Jan 22, 2025
@richardliaw richardliaw merged commit 82274f2 into ray-project:master Jan 23, 2025
6 checks passed
anson627 pushed a commit to anson627/ray that referenced this pull request Jan 31, 2025
## Why are these changes needed?

num_rows_per_file makes it seem like there is "exact" semantics, whereas
the actual underlying implementation is more like "at least".

This updates the parameter name to reflect that, but we should double
check whether or not this is the best way to expose this.

Closes ray-project#45393

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Anson Qian <anson627@gmail.com>
anson627 pushed a commit to anson627/ray that referenced this pull request Jan 31, 2025
## Why are these changes needed?

num_rows_per_file makes it seem like there is "exact" semantics, whereas
the actual underlying implementation is more like "at least".

This updates the parameter name to reflect that, but we should double
check whether or not this is the best way to expose this.

Closes ray-project#45393

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Anson Qian <anson627@gmail.com>
srinathk10 pushed a commit that referenced this pull request Feb 2, 2025
## Why are these changes needed?

num_rows_per_file makes it seem like there is "exact" semantics, whereas
the actual underlying implementation is more like "at least".

This updates the parameter name to reflect that, but we should double
check whether or not this is the best way to expose this.


Closes #45393 

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
## Why are these changes needed?

num_rows_per_file makes it seem like there is "exact" semantics, whereas
the actual underlying implementation is more like "at least".

This updates the parameter name to reflect that, but we should double
check whether or not this is the best way to expose this.

Closes ray-project#45393

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Puyuan Yao <williamyao034@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data] num_rows_per_file doesn't work with small values
3 participants