-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Object Store Composition #7171
Comments
Just some questions since I'm still fairly green when it comes to really understanding the intricacies of Rust's type system...
I'm not sure what you mean when you say that wrappers must avoid de-specializing. Does that have something to do with compiler optimizations?
So this would also involve using something like the |
They might provide their own implementations of things like delete_stream, the wrappers must therefore call through to this instead of relying on the default implementation
Perhaps, but given things like AsyncFileReader are per-file, it may be that they can just be constructed with the relevant context |
I basically agree with this statement of challenge, though I am not sure how hard it actually is in practice (having done it myself and seen various different versions of it)
I think the parquet reader also needs to be able to get the total file sizes as well (at least unless the negative ranges to fetch end bytes is supported). Adding support to
I would say "perverse" is somewhat subjective. It is certainly complex but that also needs to be measured against
This feels like a good idea to me
I think OpenDAL https://github.com/apache/opendal is trying to provide generic all-purpose IO subsystem abstraction So my personal recommendation is
|
It occurs to me this might be a great time to provide easier integration for using parquet-rs reader with OpenDAL directly (as in have a |
I will let @Xuanwo weigh in here, but I think OpenDAL is in a very similar place to ObjectStore w.r.t this, and has very similar issues. Both are abstractions for data "access", not a general IO subsystem abstraction. Edit: Ultimately something has to glue together downstream abstractions, e.g. in the context of #7135 providing a way to connect DF's SessionContext through to some IO subsystem. Either DF needs to overload some existing interface e.g. ObjectStore/OpenDAL inevitably leading to challenges like #7155 or it needs to define its own mechanism. In the case of parquet and AysncFileReaderFactory, this interface already exists we just need to point people at it.
This isn't a requirement - see here
I remain pretty lukewarm on this, given parquet-opendal already does this. |
I agree we are going to need some sort of API to pass context through the various API layers 👍 |
Thank you @tustvold for inviting me to join this discussion. I believe we should build If this becomes a reality, DataFusion can design the abstraction based on its own requirements without having to push everything upstream to We can start by aliasing the |
Problem
Initially the ObjectStore API was relatively simple, consisting of a few methods to interact with object stores. As such many systems took this abstraction and used it as a generic IO abstraction, this is good and what the crate was designed for.
As people wanted additional functionality, such as instrumentation, caching or concurrency limiting, this was implemented by creating ObjectStore implementations that wrap the existing ones. Again this worked well.
However, over time the ObjectStore API has grown, and now has 8 required methods and a further 10 methods with default implementations. This creates a number of challenges for this wrapper based approach for composition.
API Surface
As a wrapper must avoid "despecializing" methods, it must implement all 18 methods. Not only is this burdensome, but creates upgrade hazards as new methods are added, potentially in non-breaking versions.
Additional Context
As the logic within these wrappers has grown more complex, there comes the need to pass additional information through to this logic. This motivates requests like #7155
Interface Creep
In many places the ObjectStore interface gets used as the abstraction for components that don't actually require the full breadth of ObjectStore functionality. There is no need, for example, for a parquet reader to depend on more than the ability to fetch ranges of bytes.
This leads to perverse "ObjectStore" implementations, that actually only implement say get functionality. Similarly in contexts like apache/datafusion#14286 it creates complexities around how to shim the full ObjectStore interface, despite the actual operators in question only using a very small subset of this functionality.
Request Correlation
As the ObjectStore logic has gotten more sophisticated, incorporating automatic retries, request batching, etc... the relationship between an ObjectStore method call and requests has gotten rather fuzzy. This makes implementing instrumentation, concurrency limiting, tokio task dispatch, etc... at this API boundary increasingly inaccurate/problematic.
Thoughts
I personally think we should encourage a move away from this wrapper based form of composition and instead do the following:
I can't help feeling right now ObjectStore is stuck between trying to expose the functionality of ObjectStore's in a portable and ergonomic fashion, whilst also trying to provide some sort of generic all-purpose IO subsystem abstraction, which I'm not sure aren't incompatible goals....
Tagging @alamb @crepererum @Xuanwo @waynr @kylebarron
The text was updated successfully, but these errors were encountered: