-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent file based catalog #122
Comments
Regarding the point on storing data - i wonder if the CacheManager could be extended / used to serialize the files / file_stats caches. |
That is an excellent idea -- it would be really sweet to have an excuse to work on that API (I bet it is not used anywhere near as much as it could be) |
I have subsequently learned about the However, one difference is that the config file basically applies to all sessions, where this catalog would be very explicitly selected by a user when running Thus it seem like the config file would be a great place to put credentials, for example, that you wanted to apply to all catalogs / sessions |
There are two different setups that I think are relevant to this feature:
For example, this is mine:
I think the setup for this could be improved (its still based on the old setup from a couple years ago). I think moving the file to One thing to note, I have been having trouble getting the custom catalog / schemas to work (created an issue for it. Maybe just a user error, but i havent had time to look too deeply into it. |
I agree that there is an importatnt distinction between "configuration" (with e.g. credentials) that potentially apply to all sessions and the DDL/catalog setup part I think some systems permit creating credentials that are stored as part of the catalog ( I'll have to mess with it to see what is happening |
I have a similar solution that I use internally in my company, but we are opensourcing that in a tool named (Kapôt)[https://github.com/andreclaudino/datafusion-kapot] (a fork of ballista). The idea is to allow YAML definitions for tables, including or not partition, fields and types. What do you think about this approach? Would be great to make a pull request. (Actually, I am considering to change the datafusion distribution of (Kapôt)[https://github.com/andreclaudino/datafusion-kapot] to datafusion-dft as it already support some things I have plans to include like WASM UDFs. |
@claudino-kognita would you be able to make an issue to discuss the YAML / config idea in more detail? we use toml internally where i work for defining tables and it works pretty well so im not strictly opposed to extending the config here to having table definitions but i would have to think about it more. Might be able to have a feature for determining toml or yaml config format as well so it could be up to the user to decide the format of their config for reference on the WASM UDFs - that is quite new functionality that is under active development and still needs proper testing. Right now it only works with WASM primitives (i32, i64, f32, f64) per row. I think i am close to getting Arrow IPC to work - hoping to get something done by end of this week. after i get that working i plan looping the wasm udf work back into the flightsql server to try integrating some notion of auth / users only have access to WASM functions they uploaded. |
This is my own personal aspirations / goals for a "file based catalog"
Usecase
Usecase 1: pre-configured
EXTERNAL TABLES
I would like to be able to setup some table definitions in dft and then reuse them from session to session
For example
CREATE EXTERNAL TABLE ... STORED AS DELTA TABLE WITH CREDENTIALS ....
And then have this configuration available to any dft session
I believe this usecase is already partly handled by the
config
file feature. However, there are some other things I would like:Usecase 1: ephemeral data
Today when you run queries like this in
dft
If you start another session of
dft
foo
is gone:The issue is that the default catalog in datafusion is an ephemeral file based one so there is no place to store data such as shown above.
Desired Behavior
What I would like is for dft to operate similarly to sqlite or duckdb.
By default, an ephemeral in memory catalog is used and nothing is saved after the session quits
However a database file can be "opened" and if so then all changes made to the catalog are stored in that file. If the file is reopened on a subsequent invocation of the program all the DDL / catalog information is still present
Something like
The text was updated successfully, but these errors were encountered: