Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract parquet statistics to its own module, add tests #8294

Merged
merged 21 commits into from
Nov 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
d187e36
Extract parquet statistics to its own module, add tests
alamb Nov 21, 2023
7512c8b
Merge remote-tracking branch 'apache/main' into alamb/extract_parquet…
alamb Nov 27, 2023
d4e660a
Update datafusion/core/src/datasource/physical_plan/parquet/statistic…
alamb Nov 27, 2023
fd2aebc
rename enum
alamb Nov 27, 2023
96a42f9
Merge branch 'alamb/extract_parquet_statistics' of github.com:alamb/a…
alamb Nov 27, 2023
a128a20
Improve API
alamb Nov 27, 2023
b4009c2
Add test for reading struct array statistics
alamb Nov 27, 2023
ef79c42
Add test for column after statistics
alamb Nov 27, 2023
9b914db
improve tests
alamb Nov 27, 2023
b95dea9
simplify
alamb Nov 27, 2023
cd3c042
clippy
alamb Nov 27, 2023
ab95453
Update datafusion/core/src/datasource/physical_plan/parquet/statistic…
alamb Nov 27, 2023
0235a9e
Update datafusion/core/src/datasource/physical_plan/parquet/statistic…
alamb Nov 27, 2023
a601fbf
Add test showing incorrect statistics
alamb Nov 27, 2023
5c55302
Merge remote-tracking branch 'upstream/main' into alamb/extract_parqu…
tustvold Nov 28, 2023
06b5201
Rework statistics
tustvold Nov 28, 2023
641142b
Merge pull request #16 from tustvold/tustvold/extract_parquet_statistics
alamb Nov 28, 2023
e5cd8cf
Fix clippy
alamb Nov 28, 2023
b1666c2
Merge remote-tracking branch 'apache/main' into alamb/extract_parquet…
alamb Nov 28, 2023
7022691
Update documentation and make it clear the statistics are not publica…
alamb Nov 28, 2023
a5e235a
Add link to upstream arrow ticket
alamb Nov 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 2 additions & 22 deletions datafusion/core/src/datasource/physical_plan/parquet/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ mod metrics;
pub mod page_filter;
mod row_filter;
mod row_groups;
mod statistics;

pub use metrics::ParquetFileMetrics;

Expand Down Expand Up @@ -506,6 +507,7 @@ impl FileOpener for ParquetOpener {
let file_metadata = builder.metadata().clone();
let predicate = pruning_predicate.as_ref().map(|p| p.as_ref());
let mut row_groups = row_groups::prune_row_groups_by_statistics(
builder.parquet_schema(),
file_metadata.row_groups(),
file_range,
predicate,
Expand Down Expand Up @@ -718,28 +720,6 @@ pub async fn plan_to_parquet(
Ok(())
}

// Copy from the arrow-rs
// https://github.com/apache/arrow-rs/blob/733b7e7fd1e8c43a404c3ce40ecf741d493c21b4/parquet/src/arrow/buffer/bit_util.rs#L55
// Convert the byte slice to fixed length byte array with the length of 16
fn sign_extend_be(b: &[u8]) -> [u8; 16] {
assert!(b.len() <= 16, "Array too large, expected less than 16");
let is_negative = (b[0] & 128u8) == 128u8;
let mut result = if is_negative { [255u8; 16] } else { [0u8; 16] };
for (d, s) in result.iter_mut().skip(16 - b.len()).zip(b) {
*d = *s;
}
result
}

// Convert the bytes array to i128.
// The endian of the input bytes array must be big-endian.
pub(crate) fn from_bytes_to_i128(b: &[u8]) -> i128 {
// The bytes array are from parquet file and must be the big-endian.
// The endian is defined by parquet format, and the reference document
// https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L66
i128::from_be_bytes(sign_extend_be(b))
}

// Convert parquet column schema to arrow data type, and just consider the
// decimal data type.
pub(crate) fn parquet_to_arrow_decimal_type(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,8 @@ use parquet::{
};
use std::sync::Arc;

use crate::datasource::physical_plan::parquet::{
from_bytes_to_i128, parquet_to_arrow_decimal_type,
};
use crate::datasource::physical_plan::parquet::parquet_to_arrow_decimal_type;
use crate::datasource::physical_plan::parquet::statistics::from_bytes_to_i128;
use crate::physical_optimizer::pruning::{PruningPredicate, PruningStatistics};

use super::metrics::ParquetFileMetrics;
Expand Down
Loading