Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate read_page_locations() and simplify offset index in ParquetMetaData #6095

Merged
merged 2 commits into from
Jul 23, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jul 19, 2024

targets the 53.0.0 dev branch

Which issue does this PR close?

Related to #5022.

Rationale for this change

This PR simplifies the offset index handling in ParquetMetaData.

What changes are included in this PR?

This deprecates parquet::file::page_index::index_reader::read_pages_locations() in favor of read_offset_index(), and makes the necessary changes throughout the rest of the Parquet code base to make use of the new OffsetIndexMetaData structure.

Redefines ParquetOffsetIndex to be Vec<Vec<OffsetIndexMetaData>>, and removes unencoded_byte_array_data_bytes from ParquetMetaData as it is now included in the redefined ParquetOffsetIndex.

Are there any user-facing changes?

Yes, this is a breaking change.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 19, 2024
@alamb alamb added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Jul 21, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl -- I think this is an improvement. We'll get this parquet metadata stuff sorted out soon

pub type ParquetOffsetIndex = Vec<Vec<Vec<PageLocation>>>;
/// For example `offset_index[2][3]` holds the [`OffsetIndexMetaData`] for
/// the fourth column in the third row group of the parquet file.
pub type ParquetOffsetIndex = Vec<Vec<OffsetIndexMetaData>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 3bc9987 into apache:53.0.0-dev Jul 23, 2024
16 checks passed
alamb added a commit that referenced this pull request Jul 26, 2024
* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041)

* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight`

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* fix example tests

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* Remove `impl<T: AsRef<[u8]>> From<T> for Buffer`  that easily accidentally copies data (#6043)

* deprecate auto copy, ask explicit reference

* update comments

* make cargo doc happy

* Make display of interval types more pretty (#6006)

* improve dispaly for interval.

* update test in pretty, and fix display problem.

* tmp

* fix tests in arrow-cast.

* fix tests in pretty.

* fix style.

* Update snafu (#5930)

* Update Parquet thrift generated structures (#6045)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* Revert "Revert "Write Bloom filters between row groups instead of the end  (#…" (#5933)

This reverts commit 22e0b44.

* Revert "Update snafu (#5930)" (#6069)

This reverts commit 756b1fb.

* Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075)

* Update pyo3 requirement from 0.21.1 to 0.22.1

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.1)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* refactor: remove deprecated `FromPyArrow::from_pyarrow`

"GIL Refs" are being phased out.

* chore: update `pyo3` in integration tests

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* remove repeated codes to make the codes more concise. (#6080)

* Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* add support for unencoded_byte_array_data_bytes

* add comments

* change sig of ColumnMetrics::update_variable_length_bytes()

* rename ParquetOffsetIndex to OffsetSizeIndex

* rename some functions

* suggestion from review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* add Default trait to ColumnMetrics as suggested in review

* rename OffsetSizeIndex to OffsetIndexMetaData

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085)

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.2)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095)

* deprecate read_page_locations

* add to_thrift() to OffsetIndexMetaData

* Update parquet/src/column/writer/mod.rs

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
Co-authored-by: kamille <caoruiqiu.crq@antgroup.com>
Co-authored-by: Jesse <github@jessebakker.com>
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
alamb added a commit that referenced this pull request Jul 26, 2024
…aData` (#6105)

* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041)

* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight`

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* fix example tests

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* Remove `impl<T: AsRef<[u8]>> From<T> for Buffer`  that easily accidentally copies data (#6043)

* deprecate auto copy, ask explicit reference

* update comments

* make cargo doc happy

* Make display of interval types more pretty (#6006)

* improve dispaly for interval.

* update test in pretty, and fix display problem.

* tmp

* fix tests in arrow-cast.

* fix tests in pretty.

* fix style.

* Update snafu (#5930)

* Update Parquet thrift generated structures (#6045)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* Revert "Revert "Write Bloom filters between row groups instead of the end  (#…" (#5933)

This reverts commit 22e0b44.

* Revert "Update snafu (#5930)" (#6069)

This reverts commit 756b1fb.

* Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075)

* Update pyo3 requirement from 0.21.1 to 0.22.1

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.1)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* refactor: remove deprecated `FromPyArrow::from_pyarrow`

"GIL Refs" are being phased out.

* chore: update `pyo3` in integration tests

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* remove repeated codes to make the codes more concise. (#6080)

* Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* add support for unencoded_byte_array_data_bytes

* add comments

* change sig of ColumnMetrics::update_variable_length_bytes()

* rename ParquetOffsetIndex to OffsetSizeIndex

* rename some functions

* suggestion from review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* add Default trait to ColumnMetrics as suggested in review

* rename OffsetSizeIndex to OffsetIndexMetaData

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* deprecate read_page_locations

* add level histograms to metadata

* add to_thrift() to OffsetIndexMetaData

* Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085)

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.2)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095)

* deprecate read_page_locations

* add to_thrift() to OffsetIndexMetaData

* move valid test into ColumnIndexBuilder::append_histograms

* move update_histogram() inside ColumnMetrics

* Update parquet/src/column/writer/mod.rs

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

* Implement LevelHistograms as a struct

* formatting

* fix error in docs

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
Co-authored-by: kamille <caoruiqiu.crq@antgroup.com>
Co-authored-by: Jesse <github@jessebakker.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@etseidl etseidl deleted the deprecate_read_page_locations branch July 26, 2024 20:20
alamb added a commit that referenced this pull request Aug 2, 2024
…the footer (#6117)

* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041)

* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight`

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* fix example tests

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* Remove `impl<T: AsRef<[u8]>> From<T> for Buffer`  that easily accidentally copies data (#6043)

* deprecate auto copy, ask explicit reference

* update comments

* make cargo doc happy

* Make display of interval types more pretty (#6006)

* improve dispaly for interval.

* update test in pretty, and fix display problem.

* tmp

* fix tests in arrow-cast.

* fix tests in pretty.

* fix style.

* Update snafu (#5930)

* Update Parquet thrift generated structures (#6045)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* Revert "Revert "Write Bloom filters between row groups instead of the end  (#…" (#5933)

This reverts commit 22e0b44.

* Revert "Update snafu (#5930)" (#6069)

This reverts commit 756b1fb.

* Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075)

* Update pyo3 requirement from 0.21.1 to 0.22.1

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.1)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* refactor: remove deprecated `FromPyArrow::from_pyarrow`

"GIL Refs" are being phased out.

* chore: update `pyo3` in integration tests

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* remove repeated codes to make the codes more concise. (#6080)

* Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* add support for unencoded_byte_array_data_bytes

* add comments

* change sig of ColumnMetrics::update_variable_length_bytes()

* rename ParquetOffsetIndex to OffsetSizeIndex

* rename some functions

* suggestion from review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* add Default trait to ColumnMetrics as suggested in review

* rename OffsetSizeIndex to OffsetIndexMetaData

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085)

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.2)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095)

* deprecate read_page_locations

* add to_thrift() to OffsetIndexMetaData

* no longer write inline column metadata

* Update parquet/src/column/writer/mod.rs

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

* suggestion from review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* add some more documentation

* remove write_metadata from PageWriter

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
Co-authored-by: kamille <caoruiqiu.crq@antgroup.com>
Co-authored-by: Jesse <github@jessebakker.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
alamb added a commit that referenced this pull request Aug 6, 2024
* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041)

* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight`

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* fix example tests

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* Remove `impl<T: AsRef<[u8]>> From<T> for Buffer`  that easily accidentally copies data (#6043)

* deprecate auto copy, ask explicit reference

* update comments

* make cargo doc happy

* Make display of interval types more pretty (#6006)

* improve dispaly for interval.

* update test in pretty, and fix display problem.

* tmp

* fix tests in arrow-cast.

* fix tests in pretty.

* fix style.

* Update snafu (#5930)

* Update Parquet thrift generated structures (#6045)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* Revert "Revert "Write Bloom filters between row groups instead of the end  (#…" (#5933)

This reverts commit 22e0b44.

* Revert "Update snafu (#5930)" (#6069)

This reverts commit 756b1fb.

* Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075)

* Update pyo3 requirement from 0.21.1 to 0.22.1

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.1)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* refactor: remove deprecated `FromPyArrow::from_pyarrow`

"GIL Refs" are being phased out.

* chore: update `pyo3` in integration tests

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* remove repeated codes to make the codes more concise. (#6080)

* Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* add support for unencoded_byte_array_data_bytes

* add comments

* change sig of ColumnMetrics::update_variable_length_bytes()

* rename ParquetOffsetIndex to OffsetSizeIndex

* rename some functions

* suggestion from review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* add Default trait to ColumnMetrics as suggested in review

* rename OffsetSizeIndex to OffsetIndexMetaData

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085)

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.2)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095)

* deprecate read_page_locations

* add to_thrift() to OffsetIndexMetaData

* Update parquet/src/column/writer/mod.rs

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

* Upgrade protobuf definitions to flightsql 17.0 (#6133)

* Update FlightSql.proto to version 17.0

Adds new message CommandStatementIngest and removes `experimental` from
other messages.

* Regenerate flight sql protocol

This upgrades the file to version 17.0 of the protobuf definition.

* Add `ParquetMetadataWriter` allow ad-hoc encoding of `ParquetMetadata`

* fix loading in test by etseidl

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

* add rough equivalence test

* one more check

* make clippy happy

* separate tests that require arrow into a separate module

* add histograms to to_thrift()

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
Co-authored-by: kamille <caoruiqiu.crq@antgroup.com>
Co-authored-by: Jesse <github@jessebakker.com>
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Douglas Anderson <djanderson@users.noreply.github.com>
Co-authored-by: Ed Seidl <etseidl@live.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants