Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File pruning does not occur on partition columns #1175

Closed
Blajda opened this issue Feb 24, 2023 · 1 comment · Fixed by #1179
Closed

File pruning does not occur on partition columns #1175

Blajda opened this issue Feb 24, 2023 · 1 comment · Fixed by #1179
Labels
bug Something isn't working

Comments

@Blajda
Copy link
Collaborator

Blajda commented Feb 24, 2023

Environment

Delta-rs version: Main

Binding: rust


Bug

What happened:
When using Datafusion, statistics from the delta log are used to prune which files are scanned.
Delta logs do not contain statistics for partition columns in the regular stats columns but store the partition values in a separate key.

What you expected to happen:
Partition information should be used while pruning.

How to reproduce it:

This test which currently fails demonstrates that the scan visits all files for a partition that does not exist in the table.
The implementation also needs to consider null partitions are handled correctly.

#[tokio::test]
async fn test_datafusion_prune_partitioned() -> Result<()> {
    use datafusion::prelude::*;
    // Validate that partition information in include in table statistics
    let table = deltalake::open_table("./test/data/delta-2.2.0-partitioned-types")
        .await
        .unwrap();
    let statistics = table.datafusion_table_statistics();


    assert_eq!(statistics.num_rows, Some(3),);
    assert_eq!(statistics.total_byte_size, Some(452 * 3));

    async fn get_scan_metrics(table: &DeltaTable, state: &SessionState, e: Expr) -> Result<ExecutionMetricsCollector> {
        let mut metrics = ExecutionMetricsCollector::default();
        let scan = table.scan(state, None, &[e], None).await?;
        if scan.output_partitioning().partition_count() > 0 {
            let plan = CoalescePartitionsExec::new(scan);
            let task_ctx = Arc::new(TaskContext::from(state));
            let _result = collect(plan.execute(0, task_ctx)?).await?;
            visit_execution_plan(&plan, &mut metrics).unwrap();
        }

        return Ok(metrics);
    }

    let ctx = SessionContext::new();
    let state = ctx.state();

    let e = col("c1").eq(lit(1));
    let metrics = get_scan_metrics(&table, &state, e).await?;
    println!("{:?}", metrics.scanned_files);
    assert!(metrics.num_scanned_files() == 0);

    let e = col("c1").eq(lit(4));
    let metrics = get_scan_metrics(&table, &state, e).await?;
    println!("{:?}", metrics.scanned_files);
    assert!(metrics.num_scanned_files() == 1);

    let e = col("c3").eq(lit(4));
    let metrics = get_scan_metrics(&table, &state, e).await?;
    println!("{:?}", metrics.scanned_files);
    assert!(metrics.num_scanned_files() == 1);

    let e = col("c3").eq(lit(0));
    let metrics = get_scan_metrics(&table, &state, e).await?;
    println!("{:?}", metrics.scanned_files);
    assert!(metrics.num_scanned_files() == 0);


    Ok(())
} 
@Blajda Blajda added the bug Something isn't working label Feb 24, 2023
@Blajda
Copy link
Collaborator Author

Blajda commented Feb 24, 2023

In this particular case it can be fixed by updating the PruningStatics implementation to check if a column is partition column and then handle it. I think this would be a shallow fix and maybe a better involves updating get_stats directly on the DeltaTable

wjones127 pushed a commit that referenced this issue Mar 10, 2023
# Description
Exposes partition columns in Datafusion's `PruningStatistics` which will
reduce the number of files scanned when the table is queried.

This also resolves another partition issues where involving `null`
partitions. Previously `ScalarValue::Null` was used which would cause an
error when the actual datatype was obtained from the physical parquet
files.

# Related Issue(s)
- closes #1175
chitralverma pushed a commit to chitralverma/delta-rs that referenced this issue Mar 17, 2023
# Description
Exposes partition columns in Datafusion's `PruningStatistics` which will
reduce the number of files scanned when the table is queried.

This also resolves another partition issues where involving `null`
partitions. Previously `ScalarValue::Null` was used which would cause an
error when the actual datatype was obtained from the physical parquet
files.

# Related Issue(s)
- closes delta-io#1175
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant