Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet: Add support for row group pruning on FixedSizeBinary #9646

Merged
merged 2 commits into from
Mar 19, 2024

Conversation

progval
Copy link
Contributor

@progval progval commented Mar 16, 2024

Which issue does this PR close?

Closes #9645.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

It should be invisible other than a performance improvement.

@github-actions github-actions bot added the core Core DataFusion crate label Mar 16, 2024
@progval progval marked this pull request as ready for review March 19, 2024 16:46
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @progval -- as always this is looking quite good. However, I have a suggestion about data validation / error checking that i think is worth discussing prior to merge

Some(DataType::Decimal128(precision, scale)) => {
Some(ScalarValue::Decimal128(
Some(from_bytes_to_i128(s.$bytes_func())),
*precision,
*scale,
))
}
Some(DataType::FixedSizeBinary(size)) => {
Some(ScalarValue::FixedSizeBinary(
Copy link
Contributor

@alamb alamb Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we verify that the resulting bytes have the correct size? As in verify that *size is the same as (s.$bytes_func().to_vec().len()?

If not, I suggest we ignore the error (but don't use the statistics) the same way as is done for non UTF8 values i string statistics. Perhaps we can at least debug when this happens with debug!

Since this data can come from any arbitrary parquet writer, I think it is good practice to validate what is in there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sure, good catch.

The code for strings is:

let s = std::str::from_utf8(s.$bytes_func())
    .map(|s| s.to_string())
    .ok();
Some(ScalarValue::Utf8(s))

which looks like it's treated as a NULL, rather than ignored as a statistics value. Should I do the same here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think converting to NULL is the correct behavior. In this case that results in ignoring the value in terms of range analysis. More details are here if you are curious:

https://github.com/apache/arrow-datafusion/blob/8074ca1e758470319699a562074290906003b312/datafusion/core/src/physical_optimizer/pruning.rs#L85-L87

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, done. I also added a log statement for strings, for consistency.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thank you @progval

@alamb alamb merged commit ad8d552 into apache:main Mar 19, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Statistics and Bloom filters on FixedSizeBinary columns in Parquet tables
2 participants