Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make all SchemaProvider trait APIs async #10339

Closed
metesynnada opened this issue May 2, 2024 · 5 comments · Fixed by #13800
Closed

Make all SchemaProvider trait APIs async #10339

metesynnada opened this issue May 2, 2024 · 5 comments · Fixed by #13800
Labels
enhancement New feature or request

Comments

@metesynnada
Copy link
Contributor

metesynnada commented May 2, 2024

Is your feature request related to a problem or challenge?

Currently, only table API is async, but register_table, deregister_table, table_names, and table_exist can be also async. Also, most of the usage is under async methods, thus there shouldn't be a big issue implementing it.

However, this can be a huge change for the downstream usage, this is why I need some insides before me or someone else implements it.

cc @Dandandan @alamb

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@metesynnada metesynnada added the enhancement New feature or request label May 2, 2024
@matthewmturner
Copy link
Contributor

This one is actually quite interesting. I hadnt gotten around to it yet but I was going to propose having both sync and async apis for SchemaProvider (I think theres valid use cases for both). I just recently ran into an issue where I wanted the ctx.table_provider and therefore table method on SchemaProvider to be sync (as part of logical plan serialization which is sync). How would you feel about having both sync and async APIs?

@liurenjie1024
Copy link
Contributor

+1 for this proposal. We ran into similar problems when implementing datafusion/iceberg integrations: apache/iceberg-rust#324

It's more nature to allow catalog/schema operations to be async.

@alamb
Copy link
Contributor

alamb commented May 2, 2024

Yes, I agree allowing catalog APIs to be async makes sense to me

One rationale against async, as I recall from @tustvold , was to encourage people to implement efficient catalog implementations (e.g. for many cases calling a network call for table_exists might be slow)

However, my personal opinion is that such encouragement can be done via documentation and if people want to implement RPC network calls during planning then the APIs shouldn't stop them

I think the biggest challenge is, as @metesynnada hints at above, the viral nature of async -- if we make such APIs async then everywhere they are called must also be be async -- I haven't looked at how far down the stack that is but it could be substantail.

An alternate approach might be to implement, via some hackery and tokio channels, an struct that implements the SchemaProvider without changes (sync) but can call async methods (though that would block the runtime thread 🤔 )

@tustvold
Copy link
Contributor

tustvold commented May 2, 2024

I seem to remember omitting the methods other than table due to the sheer size of the resulting change, I have no objection to making them async

@westonpace
Copy link
Member

westonpace commented Nov 27, 2024

I've made a stab at this.

However, my personal opinion is that such encouragement can be done via documentation and if people want to implement RPC network calls during planning then the APIs shouldn't stop them

The easiest way we've found to do this kind of thing is with a metadata cache. However, this cache gets invalidated and has cold start, etc. The problem with "warming the cache prior to the query" is that it is very difficult to determine which entries will be required by an SQL string. Loading the entire catalog into memory for a single query is prohibitively expensive for us.

I think the biggest challenge is, as @metesynnada hints at above, the viral nature of async -- if we make such APIs async then everywhere they are called must also be be async -- I haven't looked at how far down the stack that is but it could be substantail.

Yes 😰

Although, most of the "down the stack" changes were not for the query side of the catalog but for the management side of the catalog (e.g. registering and deregistering tables) and this isn't something that is really relevant for caching anyways.

An alternate approach might be to implement, via some hackery and tokio channels, an struct that implements the SchemaProvider without changes (sync) but can call async methods (though that would block the runtime thread 🤔 )

I could find no reasonable way to implement such hackery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
6 participants