Incorrect results due to repartitioning a sorted ParquetExec #8451

alamb · 2023-12-07T11:32:34Z

Describe the bug

We have a case where the EnforceDistribution rule has repatitioned a ParquetExec which parallelized the read (which is good) but that parallelization resulted in destroying the sort order (as it mixes parts of different files together in the same partition). The rest of the plan relies on the output being sorted, and thus since it is no longer sorted we see incorrect results

To Reproduce

The input plan looks like this:

OutputRequirementExec
  ProjectionExec: expr=[tag@1 as tag]
    FilterExec: CAST(field@0 AS Utf8) !=
      ProjectionExec: expr=[field@1 as field, tag@3 as tag]
        DeduplicateExec: [tag@3 ASC,time@2 ASC]
          FilterExec: tag@3 > foo AND time@2 > 2
            ParquetExec: file_groups={2 groups: [[1.parquet], [2.parquet]]}, projection=[__chunk_order, field, time, tag], output_ordering=[tag@3 ASC, time@2 ASC, __chunk_order@0 ASC], ...

The output of EnforceDistirbution looks like this:

2023-12-06T18:40:19.827226Z TRACE datafusion::physical_planner: Optimized physical plan by EnforceDistribution:
OutputRequirementExec
  ProjectionExec: expr=[tag@1 as tag]
    FilterExec: CAST(field@0 AS Utf8) !=
      RepartitionExec: partitioning=RoundRobinBatch(6), input_partitions=1
        ProjectionExec: expr=[field@1 as field, tag@3 as tag]
          DeduplicateExec: [tag@3 ASC,time@2 ASC]
            SortPreservingMergeExec: [tag@3 ASC,time@2 ASC,__chunk_order@0 ASC] <----- This needs the input to be sorted
              FilterExec: tag@3 > foo AND time@2 > 2
                ParquetExec: file_groups={6 groups: [[1.parquet:0..1, 2.parquet:0..16666666], [2.parquet:16666666..33333333], [2.parquet:33333333..50000000], [2.parquet:50000000..66666667], [2.parquet:66666667..83333334], ...]}, ... <---- this file is no longer sorted (as it was repartitioned)

Specifically, the DataFusion planner parallelized the read of the parquet files into multiple partitions and in so doing has destroyed the sort order.

(the 16666666..33333333 annotations mean read that byte range in the file)

This is actually reflected correctly by the ParquetExec (it no longer says "output_ordering" because it is no longer sorted) however, the plan now has a SortPreservingMerge added above it, which implies that the output is sorted, which is incorrect.

Input

ParquetExec: file_groups={2 groups: [[1.parquet], [2.parquet]]}, projection=[__chunk_order, field, time, tag], output_ordering=[tag@3 ASC, time@2 ASC, __chunk_order@0 ASC],....

Output:

ParquetExec: file_groups={6 groups: [[1.parquet:0..1, 2.parquet:0..16666666], [2.parquet:16666666..33333333], [2.parquet:33333333..50000000], [2.parquet:50000000..66666667], [2.parquet:66666667..83333334], ...]}, ...

So things that are wrong:

The output of the scan is no longer sorted but it is being merged using SortPreservingMerge (which avoids the required resort)
It is not right to be repartitioning the sorted input files into multiple partitions in the first place, as that destroys the sort order. There is a config setting that is supposed to control this datafusion.optimizer.prefer_existing_sort and IOx sets it to true:

I am working on a reproducer in DataFusion

Expected behavior

The correct answer should be produced.

I think this means that either:

the ParquetExec should not be repartitioned if it would destroy the sort order,
The parquet exec repartition code should be aware of the repartition and not destroy the sort order

Additional context

We found that setting the config setting datafusion.optimizer.repartition_file_scans and IOx sets to false was a workaround:

The text was updated successfully, but these errors were encountered:

alamb · 2023-12-07T11:35:37Z

This could be potentially more subtle as splitting itself doesn't destroy the sort order, what destroys the sort order is if any file group has more than one entry that are not contiguous from the source file (as each entry is effectively appended to the previous one)

In this case, what has happened is that one group has a portion from different files in it, which is what causes the wrong results

alamb · 2023-12-08T21:51:33Z

I have made a reproducer on a branch and I know what is wrong -- I now need to work on the fix https://github.com/alamb/arrow-datafusion/tree/alamb/bad_redistribution

alamb · 2023-12-11T22:23:14Z

Here is a PR that adds tests: #8505

alamb · 2023-12-11T22:25:04Z

Update is that I have a bunch of tests and I understand the issue. I expect to have a PR up with a fix tomorrow

alamb · 2023-12-12T19:39:18Z

Update is that the PRs are ready for review:

Minor: Add repartition_file.slt end to end test for repartitioning files, and supporting tweaks #8505
Fix sort order aware file group parallelization #8517 (fix builds on the previous)

alamb added the bug Something isn't working label Dec 7, 2023

alamb self-assigned this Dec 7, 2023

This was referenced Dec 7, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 4, 2023 #8420

Closed

Minor: Improve comments in EnforceDistribution tests #8474

Merged

alamb mentioned this issue Dec 12, 2023

Fix sort order aware file group parallelization #8517

Merged

alamb closed this as completed in #8517 Dec 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect results due to repartitioning a sorted ParquetExec #8451

Incorrect results due to repartitioning a sorted ParquetExec #8451

alamb commented Dec 7, 2023

alamb commented Dec 7, 2023

alamb commented Dec 8, 2023

alamb commented Dec 11, 2023

alamb commented Dec 11, 2023

alamb commented Dec 12, 2023

Incorrect results due to repartitioning a sorted ParquetExec #8451

Incorrect results due to repartitioning a sorted ParquetExec #8451

Comments

alamb commented Dec 7, 2023

Describe the bug

To Reproduce

Expected behavior

Additional context

alamb commented Dec 7, 2023

alamb commented Dec 8, 2023

alamb commented Dec 11, 2023

alamb commented Dec 11, 2023

alamb commented Dec 12, 2023