Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table Scan Enhancement Plan #944

Closed
7 tasks done
yjshen opened this issue Aug 25, 2021 · 9 comments
Closed
7 tasks done

Table Scan Enhancement Plan #944

yjshen opened this issue Aug 25, 2021 · 9 comments
Labels
enhancement New feature or request

Comments

@yjshen
Copy link
Member

yjshen commented Aug 25, 2021

To make DataFusion scan tables more flexibly and more efficiently, we can take several further steps. (I linked several issues I am aware of)

Capability:

Performance:

@yjshen yjshen added the enhancement New feature or request label Aug 25, 2021
@yjshen
Copy link
Member Author

yjshen commented Aug 25, 2021

Also, from my own perspective, I have some extensions to make:

  • Datafusion-hdfs connector
  • Datafusion-hive metastore connector

@yjshen
Copy link
Member Author

yjshen commented Aug 26, 2021

@rdettai I listed some of the ideas we talked about in #811 or I was aware of. Please help to enhance the list as well if you have time :). Thanks!

@rdettai
Copy link
Contributor

rdettai commented Aug 27, 2021

thanks a lot for this list!

  • how do you plan to implement the hive metastore connector? as an ObjectStore implementation?
  • as commented in ObjectStore API to read from remote storage systems #950 I would also add "Filter pushdown to the listing of files". It is kind of complementary to "Enable reading partitioned table". You will also need it to implement hive metastore connector functionalities such as listPartitionsByFilter

@yjshen
Copy link
Member Author

yjshen commented Aug 27, 2021

how do you plan to implement the hive metastore connector? as an ObjectStore implementation?

As a CatalogProvider implementation, I suppose.

@houqp
Copy link
Member

houqp commented Aug 28, 2021

Here is my understanding based on @yjshen 's PRs so far:

  • CatalogProvider is a mapping from schema name to SchemaProvider (schema here refers to a collection of tables, not table schema)
  • SchemaProvider is a mapping from table name to TableProvider
  • TableProvider provides table partitions, field schemas and a scan method to perform table scan using ObjectStore to drive the IO.
    • TableProvider::scan takes push down filter expressions as argument, then issue corresponding ObjectStore list and read calls to perform minimal IO needed to fetch the data from object store

I am guessing a hive metastore connector will need to touch both SchemaProvider and TableProvider? I do think that ObjectStore is too low level to implement hive metastore.

@yjshen
Copy link
Member Author

yjshen commented Aug 28, 2021

Thanks @houqp for the detailed explanation. After checking delta-rs code as well as what we are doing currently in DataFusion. I have several thoughts:

  • It's more natural to have the TableProvider for dealing with filters since it's the abstraction over a table, therefore also a suitable entity for table partition (inferred or user-provided). we could enhance TableProvider with per partition laziness for partition file listing and file metadata extraction.
  • max_partition shouldn't be related to a Table. It's more reasonable to be deduced or set manually through the planning phase and passed to ParquetExec where it actually takes effect.

@alamb
Copy link
Contributor

alamb commented Aug 28, 2021

Thanks @yjshen I plan to review this carefully tomorrow

@Dandandan
Copy link
Contributor

@yjshen this seems mostly complete now?

@alamb
Copy link
Contributor

alamb commented Oct 17, 2022

I agree this is basically complete, so closing it down. Please feel free to reopen or file a new issue with any remaining work to be done

@alamb alamb closed this as completed Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants