Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] support passing a RowRange to RecordBatchReader #38865

Open
binmahone opened this issue Nov 23, 2023 · 5 comments · May be fixed by #39608 or #39731
Open

[C++][Parquet] support passing a RowRange to RecordBatchReader #38865

binmahone opened this issue Nov 23, 2023 · 5 comments · May be fixed by #39608 or #39731

Comments

@binmahone
Copy link

binmahone commented Nov 23, 2023

Describe the enhancement requested

Currently GetRecordBatchReader API accepts row_group_indices and column_indices. It would be nice to extend the API to accept one more parameter: A row_ranges indicating a subset of rows to be retrieved. With the provided row_ranges, RecordBatchReader can skip unnecessary pages (by comparing the row_ranges with the might-exist page index) as well as unwanted rows.

  • original:
  ::arrow::Status GetRecordBatchReader(const std::vector<int>& row_group_indices,
                                       const std::vector<int>& column_indices,
                                       std::shared_ptr<::arrow::RecordBatchReader>* out);
  • proposal:
  ::arrow::Status GetRecordBatchReader(
      const std::vector<int>& row_group_indices, const std::vector<int>& column_indices,
      const std::shared_ptr<std::map<int, RowRangesPtr>>& row_ranges_map,  # a row_ranges per Row Group
      std::shared_ptr<::arrow::RecordBatchReader>* out);

API clients can query page index or other kinds of index (e.g. external secondary index) to construct the row_ranges.

Component(s)

C++

@emkornfield
Copy link
Contributor

I raised this issue on the PR but I think we should discuss in more detail:

  1. API and implementation for RowRanges representation (there is already a similar to issue to this in [C++][Parquet] Using BMI to implement filter pushdown #37559). Part of the discussion is what functionality we want to support and complexity trade-offs, the API in the PR has a large set of manipulation functions).
  2. API exposed. I'd suggest instead of the one proposed we go for something like:
::arrow::Result<std::unique_ptr<RecordBatchReader>> GetRecordBatchReader(const RowRanges& rows_to_return,
const std::vector<int>& column_indices);

helpers can be provided to construct initial row ranges from a set of row groups if needed, but it avoids having to track them separately.

@wgtmac
Copy link
Member

wgtmac commented Nov 27, 2023

I'm refining the design doc: https://docs.google.com/document/d/1SeVcYudu6uD9rb9zRAnlLGgdauutaNZlAaS0gVzjkgM. Hopefully we can make a consensus on the API before implementation. @emkornfield

@emkornfield
Copy link
Contributor

@wgtmac thanks, the doc doesn't appear to allow commenting yet (we can try to iterate there)

@wgtmac
Copy link
Member

wgtmac commented Nov 28, 2023

Oops. Sorry about that. I have changed the doc to allow comment. Could you confirm that? @emkornfield

@emkornfield
Copy link
Contributor

@wgtmac Yes, left some comments on the doc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment