feat: make `table.read_partitions` distributed #7805

BohuTANG · 2022-09-22T08:28:23Z

Summary

table.read_partitions may do many IO operations, such as the min-max index filter or bloom filter index filter.
If a table has many partitions, the read_partitions will be very slow.

For distributed, we can:

read_partitions return segments instead of partition if the segments > 1000
Distribute the Partitions to cluster
In read2, to check file is segment or partition file

The text was updated successfully, but these errors were encountered:

BohuTANG · 2022-09-22T08:28:32Z

cc @dantengsky

Xuanwo · 2022-09-23T09:23:34Z

I expect to decouple ReadDataSourcePlan from the Table API in #7816.

Please let me know if anything I can help with. @zhang2014

BohuTANG · 2022-10-08T05:18:49Z

Impl in #7867 cc @zhang2014

BohuTANG added the C-improvement Category: improvement label Sep 22, 2022

BohuTANG assigned zhang2014 Sep 23, 2022

BohuTANG mentioned this issue Sep 23, 2022

Tracking: Large dataset insert and read #7823

Closed

50 tasks

BohuTANG closed this as completed Oct 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make `table.read_partitions` distributed #7805

feat: make `table.read_partitions` distributed #7805

BohuTANG commented Sep 22, 2022 •

edited

Loading

BohuTANG commented Sep 22, 2022

Xuanwo commented Sep 23, 2022

BohuTANG commented Oct 8, 2022

feat: make table.read_partitions distributed #7805

feat: make table.read_partitions distributed #7805

Comments

BohuTANG commented Sep 22, 2022 • edited Loading

BohuTANG commented Sep 22, 2022

Xuanwo commented Sep 23, 2022

BohuTANG commented Oct 8, 2022

feat: make `table.read_partitions` distributed #7805

feat: make `table.read_partitions` distributed #7805

BohuTANG commented Sep 22, 2022 •

edited

Loading