Make it possible to only scan part of a parquet file in a partition #1990

yjshen · 2022-03-11T08:54:59Z

Which issue does this PR close?

Part of #944

Rationale for this change

Open the possibility to only scan part of a parquet file in a task/partition.

What changes are included in this PR?

Add FileRange to PartitionedFile.
The file range is passed down to the parquet crate and filter row groups according to row groups' mid-point positions in the parquet file.

Are there any user-facing changes?

No.

alamb · 2022-03-12T10:42:40Z

cc @tustvold

tustvold · 2022-03-12T10:49:36Z

I'm not sure I follow how this will work, parquet files have a block structure internally that is not amenable to seeking. Particularly with RLE data, it is common for a column chunk to consist of a single page. Could you maybe expand a bit on this?

On a more holistic level, is there some prior art on parellelising parquet reads, I've only ever encountered file and rarely row group parallelism...

yjshen · 2022-03-12T11:15:13Z

Hi @tustvold , the filter is based on row-group midpoint position. It was introduced recently in parquet crate with apache/arrow-rs@2bca71e. The midpoint filtering is modeled after the ParquetSplit and MetadataConverter

The parquet row groups level parallelism is used in MapReduce and Spark. In Spark splitFiles is used to generate task partitions based on partition size settings. And it may partition bigger parquet file parts to different partitions.

Currently, this PR is still WIP, since only physical plan changes are implemented. And we translate Spark physical plan to DataFusion physical plan to run natively in DataFusion https://github.com/blaze-init/spark-blaze-extension/blob/master/src/main/scala/org/apache/spark/sql/blaze/plan/NativeParquetScanExec.scala#L57-L63

tustvold · 2022-03-12T11:37:22Z

Oh I think I misunderstood, this is using byte ranges to filter the row groups to scan, not to filter the rows within the row groups? That makes sense, and sounds like a useful addition 👍

yjshen · 2022-03-12T11:40:10Z

Yes, to filter row groups, based on RowGroup Metadata as well.

liukun4515 · 2022-03-14T13:23:25Z

@yjshen I am very interested in discussing and participating in the parallelism of physical execution.
Oracle has a great feature about parallelism which is called parallel execution.
In the oracle, we can use different dop(degree of parallelism) by hint or other configuration.

I have some questions about this task.

how to determine the dop for the query
how to determine the size of each task

alamb · 2022-03-20T10:33:13Z

@liukun4515 there are already some configuration settings related to parallelism

target_partitions (like degree of parallelism)
https://github.com/apache/arrow-datafusion/blob/4994eda81c2280fa78aea1ae0d92ce918947eebd/datafusion/src/execution/context.rs#L822-L823

batch_size (that controls how many batches are processed)
https://github.com/apache/arrow-datafusion/blob/4994eda81c2280fa78aea1ae0d92ce918947eebd/datafusion/src/execution/context.rs#L909-L914

I am not sure how well these two parameters are respected in all DataFusion operators, but I think the configurations settings are reasonable

yjshen · 2022-04-07T09:07:43Z

Since @tustvold is working on the new task scheduler in DataFusion, I keep this PR containing only physical plan changes. Leave planner or API unchanged.

The current changes are still helpful for downstream projects like Ballista or our Blaze, where query planning is done separately, no matter what we take later in DataFusion core. In Blaze, we use the Spark way of deciding InputSplits based on total dataset size and assigning parts of a big parquet file to multiple tasks.

@alamb @tustvold @liukun4515, please let me know what you think about this.

tustvold · 2022-04-07T09:50:32Z

Makes sense to me, regardless of what happens with scheduling, having a mechanism to cheaply subdivide the input streams directly, as opposed to streaming the output through a repartitioning operator, seems like a useful feature to have.

My expectation is scheduling will help with over-provisioned parallelism in the plan, but will still need mechanisms to express that parallelism in the first place 👍

If I have time, I'll take a look at this later today if nobody gets there first.

datafusion/core/src/datasource/listing/mod.rs

alamb

Looks like a good change to me. All that is missing is tests I think

My summary of this setting is that it would allow a user to get more parallelism in a plan by explicitly creating more partitions.

I believe that @tustvold is working on an alternate approach in #2079 and elsewhere that would decouple the plan's parallelism from its declared number of partitions, which might make this setting less valuable

datafusion/core/src/datasource/listing/mod.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

yjshen · 2022-04-12T01:54:28Z

@alamb @tustvold, could you please give another look at this? Thanks!

tustvold

This looks good to me, sorry for the delay re-reviewing

datafusion/core/src/physical_plan/file_format/parquet.rs

alamb · 2022-04-14T00:27:43Z

🎉

Filter parquet row groups by range as well

9596e29

github-actions bot added ballista datafusion Changes in the datafusion crate labels Mar 11, 2022

yjshen mentioned this pull request Mar 11, 2022

Table Scan Enhancement Plan #944

Closed

7 tasks

yjshen added the api change Changes the API exposed to users of the crate label Mar 11, 2022

Merge remote-tracking branch 'apache/master' into parquet_range_scan

a82d58d

alamb mentioned this pull request Mar 21, 2022

Update to arrow/parquet 11.0 #2048

Merged

tustvold mentioned this pull request Mar 22, 2022

Requirements for Async Parquet API apache/arrow-rs#1473

Closed

yjshen added 6 commits March 23, 2022 11:28

Merge remote-tracking branch 'apache/master' into parquet_range_scan

2ff4dcd

fix

e6a9cdd

WIP: case when expr works

6a61735

short-circuit case_when

bc78f97

else

602034c

Merge branch 'case_fix' into parquet_range_scan

245a633

tustvold mentioned this pull request Mar 24, 2022

RFC: More Granular File Operators #2079

Closed

yjshen added 2 commits April 7, 2022 16:13

Merge remote-tracking branch 'apache/master' into parquet_range_scan

36d8070

only range part

bba90fc

yjshen removed the api change Changes the API exposed to users of the crate label Apr 7, 2022

yjshen marked this pull request as ready for review April 7, 2022 08:41

yjshen changed the title ~~WIP: Finer-grained parallelism for Parquet Scan~~ Make it possible to only scan part of a parquet file in a task Apr 7, 2022

yjshen changed the title ~~Make it possible to only scan part of a parquet file in a task~~ Make it possible to only scan part of a parquet file in a partition Apr 7, 2022

alamb reviewed Apr 8, 2022

View reviewed changes

datafusion/core/src/datasource/listing/mod.rs Outdated Show resolved Hide resolved

alamb reviewed Apr 10, 2022

View reviewed changes

datafusion/core/src/datasource/listing/mod.rs Outdated Show resolved Hide resolved

yjshen and others added 3 commits April 10, 2022 22:02

Update datafusion/core/src/datasource/listing/mod.rs

636631d

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

test

ca55960

Merge branch 'master' into parquet_range_scan

8ecd627

tustvold approved these changes Apr 12, 2022

View reviewed changes

jimexist reviewed Apr 12, 2022

View reviewed changes

datafusion/core/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved

yjshen added 3 commits April 12, 2022 16:17

Update parquet.rs

b574a94

Merge branch 'master' into parquet_range_scan

3c1fd8f

fix

f20ef44

yjshen merged commit e7b08ed into apache:master Apr 14, 2022

yjshen deleted the parquet_range_scan branch April 22, 2022 08:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it possible to only scan part of a parquet file in a partition #1990

Make it possible to only scan part of a parquet file in a partition #1990

yjshen commented Mar 11, 2022 •

edited

Loading

alamb commented Mar 12, 2022

tustvold commented Mar 12, 2022 •

edited

Loading

yjshen commented Mar 12, 2022

tustvold commented Mar 12, 2022 •

edited

Loading

yjshen commented Mar 12, 2022

liukun4515 commented Mar 14, 2022

alamb commented Mar 20, 2022

yjshen commented Apr 7, 2022

tustvold commented Apr 7, 2022 •

edited

Loading

alamb left a comment

yjshen commented Apr 12, 2022

tustvold left a comment

alamb commented Apr 14, 2022

Make it possible to only scan part of a parquet file in a partition #1990

Make it possible to only scan part of a parquet file in a partition #1990

Conversation

yjshen commented Mar 11, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb commented Mar 12, 2022

tustvold commented Mar 12, 2022 • edited Loading

yjshen commented Mar 12, 2022

tustvold commented Mar 12, 2022 • edited Loading

yjshen commented Mar 12, 2022

liukun4515 commented Mar 14, 2022

alamb commented Mar 20, 2022

yjshen commented Apr 7, 2022

tustvold commented Apr 7, 2022 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

yjshen commented Apr 12, 2022

tustvold left a comment

Choose a reason for hiding this comment

alamb commented Apr 14, 2022

yjshen commented Mar 11, 2022 •

edited

Loading

tustvold commented Mar 12, 2022 •

edited

Loading

tustvold commented Mar 12, 2022 •

edited

Loading

tustvold commented Apr 7, 2022 •

edited

Loading