Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Object Store Composition #7171

Open
tustvold opened this issue Feb 21, 2025 · 8 comments
Open

[Discussion] Object Store Composition #7171

tustvold opened this issue Feb 21, 2025 · 8 comments
Labels
object-store Object Store Interface question Further information is requested

Comments

@tustvold
Copy link
Contributor

Problem

Initially the ObjectStore API was relatively simple, consisting of a few methods to interact with object stores. As such many systems took this abstraction and used it as a generic IO abstraction, this is good and what the crate was designed for.

As people wanted additional functionality, such as instrumentation, caching or concurrency limiting, this was implemented by creating ObjectStore implementations that wrap the existing ones. Again this worked well.

However, over time the ObjectStore API has grown, and now has 8 required methods and a further 10 methods with default implementations. This creates a number of challenges for this wrapper based approach for composition.

API Surface

As a wrapper must avoid "despecializing" methods, it must implement all 18 methods. Not only is this burdensome, but creates upgrade hazards as new methods are added, potentially in non-breaking versions.

Additional Context

As the logic within these wrappers has grown more complex, there comes the need to pass additional information through to this logic. This motivates requests like #7155

Interface Creep

In many places the ObjectStore interface gets used as the abstraction for components that don't actually require the full breadth of ObjectStore functionality. There is no need, for example, for a parquet reader to depend on more than the ability to fetch ranges of bytes.

This leads to perverse "ObjectStore" implementations, that actually only implement say get functionality. Similarly in contexts like apache/datafusion#14286 it creates complexities around how to shim the full ObjectStore interface, despite the actual operators in question only using a very small subset of this functionality.

Request Correlation

As the ObjectStore logic has gotten more sophisticated, incorporating automatic retries, request batching, etc... the relationship between an ObjectStore method call and requests has gotten rather fuzzy. This makes implementing instrumentation, concurrency limiting, tokio task dispatch, etc... at this API boundary increasingly inaccurate/problematic.

Thoughts

I personally think we should encourage a move away from this wrapper based form of composition and instead do the following:

  • Encourage use of specialized traits like parquet's AsyncFileReader that reflect what a given component actually needs, and can evolve independently of ObjectStore
  • Add additional functionality for injecting logic into the HTTP request path (Decouple ObjectStore from Reqwest / Generic HTTP Client Support #6056) allowing
    • More accurate instrumentation
    • More accurate concurrency limiting
    • Potential sophistication w.r.t tokio runtime dispatch

I can't help feeling right now ObjectStore is stuck between trying to expose the functionality of ObjectStore's in a portable and ergonomic fashion, whilst also trying to provide some sort of generic all-purpose IO subsystem abstraction, which I'm not sure aren't incompatible goals....

Tagging @alamb @crepererum @Xuanwo @waynr @kylebarron

@tustvold tustvold added object-store Object Store Interface question Further information is requested labels Feb 21, 2025
@waynr
Copy link

waynr commented Feb 21, 2025

Just some questions since I'm still fairly green when it comes to really understanding the intricacies of Rust's type system...

As a wrapper must avoid "despecializing" methods, it must implement all 18 methods

I'm not sure what you mean when you say that wrappers must avoid de-specializing. Does that have something to do with compiler optimizations?

Encourage use of specialized traits like parquet's AsyncFileReader that reflect what a given component actually needs, and can evolve independently of ObjectStore

So this would also involve using something like the ParquetFileReaderFactory, right? And that's the level at which, in the case of a caching implementation that I described in #7135 and #7155, I would need to have session state/config information available to pass to a custom implementation to get properly-parented spans? It looks like this and related interfaces don't currently support accepting opaque contextual data but maybe you're suggesting they are more open to that kind of change than ObjectStore?

@tustvold
Copy link
Contributor Author

de-specializing

They might provide their own implementations of things like delete_stream, the wrappers must therefore call through to this instead of relying on the default implementation

It looks like this and related interfaces don't currently support accepting opaque contextual data but maybe you're suggesting they are more open to that kind of change than ObjectStore?

Perhaps, but given things like AsyncFileReader are per-file, it may be that they can just be constructed with the relevant context

@alamb
Copy link
Contributor

alamb commented Feb 23, 2025

However, over time the ObjectStore API has grown, and now has 8 required methods and a further 10 methods with default implementations. This creates a number of challenges for this wrapper based approach for composition.

I basically agree with this statement of challenge, though I am not sure how hard it actually is in practice (having done it myself and seen various different versions of it)

Interface Creep

In many places the ObjectStore interface gets used as the abstraction for components that don't actually require the full breadth of ObjectStore functionality. There is no need, for example, for a parquet reader to depend on more than the ability to fetch ranges of bytes.

I think the parquet reader also needs to be able to get the total file sizes as well (at least unless the negative ranges to fetch end bytes is supported). Adding support to

This leads to perverse "ObjectStore" implementations, that actually only implement say get functionality. Similarly in contexts like apache/datafusion#14286 it creates complexities around how to shim the full ObjectStore interface, despite the actual operators in question only using a very small subset of this functionality.

I personally think we should encourage a move away from this wrapper based form of composition and instead do the following:

  • Encourage use of specialized traits like parquet's AsyncFileReader that reflect what a given component actually needs, and can evolve independently of ObjectStore

I would say "perverse" is somewhat subjective. It is certainly complex but that also needs to be measured against

  1. The complexity of other alternatives
  2. The complexity of the problem being solved
  3. The complexity of using the API vs implementing the API. For example, the AsyncFileReader in parquet specifically adds non trivial complexity to using the reader

This feels like a good idea to me

I can't help feeling right now ObjectStore is stuck between trying to expose the functionality of ObjectStore's in a portable and ergonomic fashion, whilst also trying to provide some sort of generic all-purpose IO subsystem abstraction, which I'm not sure aren't incompatible goals....

I think OpenDAL https://github.com/apache/opendal is trying to provide generic all-purpose IO subsystem abstraction

So my personal recommendation is

  1. Keep the object store API the same / don't expand it
  2. Support Decouple ObjectStore from Reqwest / Generic HTTP Client Support #6056
  3. Add additional documentation / examples for more advanced functionality
  4. Point people at OpenDAL if they need more advanced features

@alamb
Copy link
Contributor

alamb commented Feb 23, 2025

It occurs to me this might be a great time to provide easier integration for using parquet-rs reader with OpenDAL directly (as in have a open-dal feature for parquet-rs. I think @Xuanwo has proposed this in the past.

@tustvold
Copy link
Contributor Author

tustvold commented Feb 23, 2025

I think OpenDAL https://github.com/apache/opendal is trying to provide generic all-purpose IO subsystem abstraction

I will let @Xuanwo weigh in here, but I think OpenDAL is in a very similar place to ObjectStore w.r.t this, and has very similar issues. Both are abstractions for data "access", not a general IO subsystem abstraction.

Edit: Ultimately something has to glue together downstream abstractions, e.g. in the context of #7135 providing a way to connect DF's SessionContext through to some IO subsystem. Either DF needs to overload some existing interface e.g. ObjectStore/OpenDAL inevitably leading to challenges like #7155 or it needs to define its own mechanism. In the case of parquet and AysncFileReaderFactory, this interface already exists we just need to point people at it.

I think the parquet reader also needs to be able to get the total file sizes as well (at least unless the negative ranges to fetch end bytes is supported). Adding support to

This isn't a requirement - see here

provide easier integration for using parquet-rs reader with OpenDAL directly

I remain pretty lukewarm on this, given parquet-opendal already does this.

@tustvold
Copy link
Contributor Author

Support #6056

Actually having started playing around with this, we end up with the same problem here that we end up needing to pass context through to these layers. So perhaps this is really just moving the problem, and we're going to end up with something like #7155 regardless 🤔

@alamb
Copy link
Contributor

alamb commented Feb 23, 2025

Support #6056

Actually having started playing around with this, we end up with the same problem here that we end up needing to pass context through to these layers. So perhaps this is really just moving the problem, and we're going to end up with something like #7155 regardless 🤔

I agree we are going to need some sort of API to pass context through the various API layers 👍

@Xuanwo
Copy link
Member

Xuanwo commented Feb 23, 2025

Edit: Ultimately something has to glue together downstream abstractions, e.g. in the context of #7135 providing a way to connect DF's SessionContext through to some IO subsystem. Either DF needs to overload some existing interface e.g. ObjectStore/OpenDAL inevitably leading to challenges like #7155 or it needs to define its own mechanism. In the case of parquet and AysncFileReaderFactory, this interface already exists we just need to point people at it.

Thank you @tustvold for inviting me to join this discussion.

I believe we should build datafusion-storage primarily focused on DataFusion's own needs while maintaining datafusion-storage-object-store and datafusion-storage-opendal separately. The benefit is that users can implement innovative features like datafusion-storage-cudf or datafusion-storage-io_uring without being constrained by the current I/O abstraction of object-store or OpenDAL.

If this becomes a reality, DataFusion can design the abstraction based on its own requirements without having to push everything upstream to object_store. This would allow them to maintain useful features such as context management and add additional requirements to the trait while letting datafusion-storage-object-store and datafusion-storage-opendal handle the extra work.

We can start by aliasing the ObjectStore trait inside datafusion-storage first. I'm happy to initiate a proposal if that sounds like a good idea to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
object-store Object Store Interface question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants