Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for level histograms added in PARQUET-2261 to ParquetMetaData #6105

Merged
merged 27 commits into from
Jul 26, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jul 23, 2024

Which issue does this PR close?

Closes #5022.

Rationale for this change

Final piece of the new SizeStatistics. Supercedes #5486.

What changes are included in this PR?

Adds repetition and definition level histograms to ParquetMetaData and to the PageIndex struct (representation of the thrift ColumnIndex).

Are there any user-facing changes?

Yes.

BugenZhao and others added 17 commits July 16, 2024 18:05
)

* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight`

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* fix example tests

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>
…ally copies data (apache#6043)

* deprecate auto copy, ask explicit reference

* update comments

* make cargo doc happy
* improve dispaly for interval.

* update test in pretty, and fix display problem.

* tmp

* fix tests in arrow-cast.

* fix tests in pretty.

* fix style.
* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays
* Update pyo3 requirement from 0.21.1 to 0.22.1

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.1)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* refactor: remove deprecated `FromPyArrow::from_pyarrow`

"GIL Refs" are being phased out.

* chore: update `pyo3` in integration tests

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* add support for unencoded_byte_array_data_bytes

* add comments

* change sig of ColumnMetrics::update_variable_length_bytes()

* rename ParquetOffsetIndex to OffsetSizeIndex

* rename some functions

* suggestion from review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* add Default trait to ColumnMetrics as suggested in review

* rename OffsetSizeIndex to OffsetIndexMetaData

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.2)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…MetaData` (apache#6095)

* deprecate read_page_locations

* add to_thrift() to OffsetIndexMetaData
@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 23, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl -- I think this look really nice

The only thing I could think that we might want to explore is using a rust struct rather than a Option<Vec<i64> so it could be easier to use / document. I'll make a PR to explore what that looks like

@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

The clippy failure is likely due to the new rust version which was released today. I fixed it on main. Also since we have made a 52.2.0 release candidate I think we can move 53.0.0 development off of 53.0.0-dev to main. I will create a PR to do so

@etseidl
Copy link
Contributor Author

etseidl commented Jul 25, 2024

The only thing I could think that we might want to explore is using a rust struct rather than a Option<Vec<i64> so it could be easier to use / document. I'll make a PR to explore what that looks like

❤️

Yes, while I was reviewing my changes I was thinking the same thing. A histogram class that supports updates, unions, and concatenation.

@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

Update for anyone following along: #6126 is the merge of the dev branch to main

@etseidl
Copy link
Contributor Author

etseidl commented Jul 25, 2024

Merge went ok...once #6126 merges I'll rebase this on master.

@alamb alamb deleted the branch apache:master July 26, 2024 10:11
@alamb alamb closed this Jul 26, 2024
@alamb
Copy link
Contributor

alamb commented Jul 26, 2024

I merged the 53 dev branch ~and that seems to have closed this PR` -- any chance you can retarget main?

Update: I restored the branch

@alamb alamb reopened this Jul 26, 2024
@alamb alamb changed the base branch from 53.0.0-dev to master July 26, 2024 11:16
@alamb alamb changed the base branch from master to 53.0.0-dev July 26, 2024 11:17
@alamb
Copy link
Contributor

alamb commented Jul 26, 2024

Hi @etseidl -- could you please retarget this PR at the main branch so I can merge it in?

I also made etseidl#1 for your consideration (that make the level histogram a struct). We could also merge this PR as is and consider that as a follow on.

Thank you for bearing with me here.

@etseidl etseidl changed the base branch from 53.0.0-dev to master July 26, 2024 15:15
@etseidl
Copy link
Contributor Author

etseidl commented Jul 26, 2024

Hi @etseidl -- could you please retarget this PR at the main branch so I can merge it in?

done

I also made etseidl#1 for your consideration (that make the level histogram a struct). We could also merge this PR as is and consider that as a follow on.

@alamb I too created a LevelHistogram struct last night...yours is much more elegant 😄. I'm game to merge your PR into my branch before merging this into master. Or you can submit yours as a follow on PR. Whichever is easier for you. Nevermind, your changes were too nice to resist, so I merged them in.

Thank you for bearing with me here.

Thank you for all the help and for the free Rust schooling 😄

Comment on lines +1229 to +1232
if let Some(ref rep_lvl_hist) = repetition_level_histogram {
let hist = self.repetition_level_histograms.get_or_insert(Vec::new());
hist.reserve(rep_lvl_hist.len());
hist.extend(rep_lvl_hist.values());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb should we also replace the vectors in ColumnIndexBuilder with LevelHistogram? We could add an append() method to LevelHistogram to reduce some code duplication here.

Same question for PageIndex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth with this -- I think I was thinking this was building the thrift structure so we should use the thrift representation (Vec<i64>) but I may have gotten that wrong as I find the naming quite confusing

How about we explore doing it as a follow on PR?

@alamb
Copy link
Contributor

alamb commented Jul 26, 2024

Let's do this!

@alamb
Copy link
Contributor

alamb commented Jul 26, 2024

Thank you for all the help and for the free Rust schooling 😄

That is half the fun of working on these projects -- I see a lot of cool things while reviewing / working on PRs

@alamb alamb merged commit b06ffce into apache:master Jul 26, 2024
16 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 26, 2024

Thanks again @etseidl -- I merged this to main and let's keep working to improve things there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tracking for Parquet size statistics
7 participants