Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ListingTable to use StatisticsConverter, remove redundant statistics extraction code #10924

Closed
wants to merge 5 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jun 14, 2024

Which issue does this PR close?

Closes #10923

Rationale for this change

The code that extracted statistics for ListingTable with parquet files:

  1. Was incomplete and did not handle all types (like strings)
  2. Did the same thing (extracted Row Group statistics from parquet files) but used different code than the ParquetExec
  3. Was inefficient (accumulated by creating single row record batches for each row group)

I also personally also found the previous code somewhat hard to follow

See #10923 for more details

What changes are included in this PR?

Update Parquet listing table to use the new StatisticsConverter API to create DataFusion statistics

Are these changes tested?

Yes, they are covered by existing tests

Are there any user-facing changes?

ListingFiles will now have more accurate statistics

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jun 14, 2024
ParquetStatistics::Int64(s) if DataType::Int64 == *fields[i].data_type() => {
if let Some(max_value) = &mut max_values[i] {
max_value
.update_batch(&[Arc::new(Int64Array::from_value(*s.max(), 1))])
Copy link
Contributor Author

@alamb alamb Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, even though update_batch is called, it first creates a single row array

}
}
let file_field = file_schema.field(file_idx);
let Some(converter) = StatisticsConverter::try_new(
Copy link
Contributor Author

@alamb alamb Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code now uses the well tested StatisticsConverter to extract statistics from the parquet file with the correct type of array in a single call

// Note in ASCII lower case is greater than upper case
assert_eq!(
c1_stats.max_value,
Precision::Exact(ScalarValue::from("bar"))
Copy link
Contributor Author

@alamb alamb Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings were previously not handled, but are now properly handled by StatisticsConverter

datafusion/sqllogictest/test_files/explain.slt Outdated Show resolved Hide resolved
@@ -344,6 +344,24 @@ impl ColumnStatistics {
distinct_count: Precision::Absent,
}
}

/// Set the null count for the column
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions make creating ColumnStatitistic more ergonomic

@alamb alamb changed the title Update ListingTable to use StatisticsConverter Update ListingTable to use StatisticsConverter, remove redundant statistics extraction code Jun 17, 2024
@alamb alamb marked this pull request as ready for review June 17, 2024 16:25
@alamb
Copy link
Contributor Author

alamb commented Jun 24, 2024

Thank you for the review @xinlifoobar

@Ted-Jiang I wonder if you might be interested in reviewing this PR as well?

Update: let's work with #11068 (I didn't realize there was another PR)

Thanks @xinlifoobar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update ListingTable to use StatisticsConverter
2 participants