Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: query data from S3 location or stage #7211

Closed
Tracked by #7592
BohuTANG opened this issue Aug 20, 2022 · 6 comments
Closed
Tracked by #7592

feat: query data from S3 location or stage #7211

BohuTANG opened this issue Aug 20, 2022 · 6 comments
Assignees
Labels
A-query Area: databend query C-feature Category: feature

Comments

@BohuTANG
Copy link
Member

Summary

Make databend as a query engine, query data from S3 location or stage directly.

Refer:
https://docs.snowflake.com/en/user-guide/querying-stage.html

@BohuTANG BohuTANG added the C-feature Category: feature label Aug 20, 2022
@Xuanwo
Copy link
Member

Xuanwo commented Aug 22, 2022

The potancial that I can recongenize from this feature:

  • Query data from dropbox/google drive: we can empower personal/enterprise users without complex infra
  • Load data in background: Users query as normal but copy data to databend cloud at the same time. Once load are ready, users can query in a more efficient way.

@doki23
Copy link
Contributor

doki23 commented Aug 22, 2022

/assignme

@doki23
Copy link
Contributor

doki23 commented Aug 22, 2022

Hmm, is /assignme invalid?

@BohuTANG
Copy link
Member Author

  • Load data in background: Users query as normal but copy data to databend cloud at the same time. Once load are ready, users can query in a more efficient way.

There is no COPY here, we can transform the parquet files to fuse engine files directly, for example:

Users can create a table:

CREATE table xx ... location='s3://<user-bucket-path>'  CONNECTION=...

If the location is parquet files and not created by fuse engine, we can query them in normal way:

  1. list all the parquet files
  2. query them without any optimization (Since it does not have fuse indexes)

If the user does some optimization like:

optimize table xx; -- this statement syntax is a demo

We can:

  1. create min/max and other all fuse indexes for the parquet files without loading them
  2. convert all parquet files as the fuse engine files, and store some metadata to metasrv

I think @dantengsky have some ideas on it.

@BohuTANG
Copy link
Member Author

Hi @doki23 ,

This feature is related to much databend-query internal mod refractory(Such as planner bind_sql and schema infer), so it's hard to do it now.
So, let's re-assign this issue to @youngsofun , he will start the task and complete the first phase: querying the parquet file from stage/location.

cc @Xuanwo @sundy-li @dantengsky

@BohuTANG BohuTANG assigned youngsofun and unassigned doki23 Nov 21, 2022
@doki23
Copy link
Contributor

doki23 commented Nov 21, 2022

Get it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-query Area: databend query C-feature Category: feature
Projects
None yet
Development

No branches or pull requests

4 participants