Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to arrow/parquet 11.0 #2048

Merged
merged 3 commits into from
Mar 22, 2022
Merged

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Mar 21, 2022

Update datafusion to latest arrow and parquet release to unblock things like #1990 from @yjshen

@github-actions github-actions bot added ballista datafusion Changes in the datafusion crate labels Mar 21, 2022
@alamb
Copy link
Contributor Author

alamb commented Mar 21, 2022

I think this will need some of the changes in #1990 to datafusion/src/physical_plan/file_format/parquet.rs order to compile. I'll try and get around to it later this week if no one else beats me to it

@yjshen
Copy link
Member

yjshen commented Mar 21, 2022

Yes, I've introduced a parquet row group filtering API change in parquet 11. I can port that part from #1990 to your branch.

@@ -262,7 +262,7 @@ async fn prune_int32_scalar_fun() {
println!("{}", output.description());
// This should prune out groups with error, because there is not col to
// prune the row groups.
assert_eq!(output.predicate_evaluation_errors(), Some(1));
assert_eq!(output.predicate_evaluation_errors(), Some(4));
Copy link
Member

@yjshen yjshen Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are evaluating the filter for each row group now. I think it's an expected change for the number of evaluation errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree.

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes to the parquet reader look good to me @yjshen -- thank you. I can't really approve my own PR but I suppose I'll leave this one up until tomorrow to see if there is any more feedback

Otherwise we can merge it in

🚀

row_group_metadata,
parquet_schema,
};
let predicate_values = pruning_predicate.prune(&pruning_stats);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is probably some overhead here related to calling prune once per row group vs calling it once per file, but I think it will be ok and we can further optimize it in the future if it shows up in traces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I just stumbled across this whilst updating #1617 - in IOx we found the prune method had non-trivial overheads when run in a non-columnar fashion as this is doing. Admittedly that was likely with more containers than there are likely to be row groups in a file.

I do wonder if we need to take a step back from extending the parquet arrow-rs interface, and take a more holistic look at what the desired end-state should be. I worry a bit that we're painting ourselves into a corner, I'll see if I can get my thoughts into some tickets

Copy link
Member

@yjshen yjshen Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we change ReadOptions like:

pub struct ReadOptions {
    predicates: Vec<Box<dyn Fn(&[RowGroupMetaData]) -> vec<bool>>>,
}

Copy link
Contributor

@tustvold tustvold Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would definitely be one option, but I'm not sure why it needs to be lazy. SerializedFileReader already exposes the ParquetMetadata which in turn exposes the [RowGroupMetaData]. Why wouldn't the caller just specify the row groups to scan, much like it specifies the column indexes for a projection? Would this not be both simpler and more flexible?

@@ -262,7 +262,7 @@ async fn prune_int32_scalar_fun() {
println!("{}", output.description());
// This should prune out groups with error, because there is not col to
// prune the row groups.
assert_eq!(output.predicate_evaluation_errors(), Some(1));
assert_eq!(output.predicate_evaluation_errors(), Some(4));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I agree.

$self
.row_group_metadata
.column(column_index)
.statistics()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants