Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust,py): add additional control to write_parquet::statistics parameter #16575

Merged
merged 1 commit into from
Jun 3, 2024

Conversation

coastalwhite
Copy link
Collaborator

Adds additional control over which statistics are written into Parquet files through the write_parquet parameter statistics.

It is now possible to specify "full" to also attempt to add the distinct_count statistic (currently only added for Booleans). It is also possible to give a dict[str, bool] to specify individual statistics min, max, distinct_count and null_count.

Fixes #16441

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature rust Related to Rust Polars labels May 29, 2024
@coastalwhite coastalwhite changed the title feat(rust,py): add additional control to `write_parquetstatistics param feat(rust,py): add additional control to write_parquet::statistics parameter May 29, 2024
@coastalwhite coastalwhite force-pushed the write-statistics branch 3 times, most recently from 8c38cfa to 7ef6f45 Compare May 29, 2024 13:57
@coastalwhite
Copy link
Collaborator Author

I think the CI is having a small problem.

Copy link

codecov bot commented May 29, 2024

Codecov Report

Attention: Patch coverage is 63.92573% with 136 lines in your changes are missing coverage. Please review.

Project coverage is 81.46%. Comparing base (0eb8384) to head (a8b7b7c).
Report is 48 commits behind head on main.

Files Patch % Lines
.../polars-parquet/src/arrow/write/fixed_len_bytes.rs 50.00% 44 Missing ⚠️
crates/polars-parquet/src/arrow/write/mod.rs 58.33% 35 Missing ⚠️
...tes/polars-parquet/src/arrow/write/binary/basic.rs 0.00% 24 Missing ⚠️
...rates/polars-parquet/src/arrow/write/dictionary.rs 20.00% 24 Missing ⚠️
py-polars/src/conversion/mod.rs 81.25% 3 Missing ⚠️
crates/polars-compute/src/min_max/scalar.rs 90.90% 2 Missing ⚠️
...es/polars-parquet/src/arrow/write/binary/nested.rs 0.00% 2 Missing ⚠️
crates/polars-compute/src/distinct_count.rs 87.50% 1 Missing ⚠️
crates/polars-parquet/src/arrow/write/sink.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16575      +/-   ##
==========================================
- Coverage   81.49%   81.46%   -0.03%     
==========================================
  Files        1414     1416       +2     
  Lines      185561   186837    +1276     
  Branches     2997     3021      +24     
==========================================
+ Hits       151219   152213     +994     
- Misses      33826    34091     +265     
- Partials      516      533      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@coastalwhite coastalwhite force-pushed the write-statistics branch 2 times, most recently from 07f969a to 5a335d0 Compare May 31, 2024 14:39
@coastalwhite coastalwhite force-pushed the write-statistics branch 2 times, most recently from a2c1c04 to 0d12846 Compare June 3, 2024 08:18
Copy link

codspeed-hq bot commented Jun 3, 2024

CodSpeed Performance Report

Merging #16575 will not alter performance

Comparing coastalwhite:write-statistics (a8b7b7c) with main (534b655)

Summary

✅ 37 untouched benchmarks

Adds additional control over which statistics are written into Parquet files
through the `write_parquet` parameter `statistics`.

It is now possible to specify `"full"` to also attempt to add the
`distinct_count` statistic (currently only added for `Booleans`). It is also
possible to give a `dict[str, bool]` to specify individual statistics `min`,
`max`, `distinct_count` and `null_count`.

Fixes pola-rs#16441
@coastalwhite coastalwhite self-assigned this Jun 3, 2024
.flatten()
.min_by(|x, y| ord_binary(x, y))
.map(|x| x.to_vec()),
max_value: options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we depend on polars-compute, can we use the kernels directly here? There are SIMD :)

.map(|x| x.to_vec())
})
.flatten(),
min_value: options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

.flatten()
.min_by(|x, y| ord_binary(x, y))
.map(|x| x.to_vec()),
max_value: options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

.map(|x| x.to_vec()),
max_value: options
.max_value
.then(|| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we got a kernel for this one. 🤔

@ritchie46 ritchie46 merged commit 9223b10 into pola-rs:main Jun 3, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change DataFrame.write_parquet(write_statistics) to a more granular type
2 participants