-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] support passing a RowRange to RecordBatchReader #38865
Comments
I raised this issue on the PR but I think we should discuss in more detail:
helpers can be provided to construct initial row ranges from a set of row groups if needed, but it avoids having to track them separately. |
I'm refining the design doc: https://docs.google.com/document/d/1SeVcYudu6uD9rb9zRAnlLGgdauutaNZlAaS0gVzjkgM. Hopefully we can make a consensus on the API before implementation. @emkornfield |
@wgtmac thanks, the doc doesn't appear to allow commenting yet (we can try to iterate there) |
Oops. Sorry about that. I have changed the doc to allow comment. Could you confirm that? @emkornfield |
@wgtmac Yes, left some comments on the doc |
Describe the enhancement requested
Currently GetRecordBatchReader API accepts row_group_indices and column_indices. It would be nice to extend the API to accept one more parameter: A row_ranges indicating a subset of rows to be retrieved. With the provided row_ranges, RecordBatchReader can skip unnecessary pages (by comparing the row_ranges with the might-exist page index) as well as unwanted rows.
API clients can query page index or other kinds of index (e.g. external secondary index) to construct the row_ranges.
Component(s)
C++
The text was updated successfully, but these errors were encountered: