-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable users to add support for LIstingTable / object_store table formats of different types #8345
Comments
Thank you for bringing this up @tychoish -- I agree the current state of FileType/FileFormat is not ideal as it means, as you have pointed out, that it is not possible to use ListingTable with user defined file formats.
I would very much like to see this and I think it would result in the code being significantly more modular as well. That being said as you have discovered it is a pretty major undertaking, but I would be interested in helping make it happen / reviewing PRs and designs (though I don't think I have time at the moment to contribute code) perhaps @devinjdangelo or @tustvold have some ideas to add as well |
I guess my first question is "are Second very high level question: is there anything to be done in Secondly, (and this is somewhat orthogonal) but for the project that brought me to this point; I'm leaning towards just to implementing Would love to hear additional |
I don't see any fundamental reason why there should be multiple file format related abstractions. I think it is mostly a product of development on different parts of the codebase related to files progressing at the same time. I know I was guilty at one point of creating a third file related enum, which was consolidated into FileType later. I took a brief look at where we are matching on the FileType enum, and I don't see any reason why those sections could not be refactored as method calls on a trait object. It would probably make the most sense to add that functionality to FileFormat and consolidate all file related abstractions under one trait.
I'm not sure, but I think it makes a lot of sense to handle file type specific related operations using a single trait object, similar to how we currently do with ObjectStore.
I am not very familiar with the new StreamingTable, but it appears to be a struct rather than a trait. StreamingTable itself implements TableProvider, so I'm not quite sure how it helps here. @tustvold may know better than I on this though. |
https://docs.rs/datafusion/latest/datafusion/datasource/streaming/struct.StreamingTable.html Is simply a way to reduce the amount of boilerplate when implementing a TableProvider. It makes a lot of sense if taking this route. As for the distinction between the enumeration and trait, serialization might be part of the reason behind this design. It is certainly part of the rationale behind using ListingTableUrl to provide a serializable version of ObjectStore |
BTW I really love the idea of a trait based implementation of file formats as it would permit (eventually) pulling the format support (and its dependecies) into separate crates and would make the codebase more manageable.
I agree this sounds like a good idea. If we want to head down this route, I would suggest making a Proof of Concept (POC) PR that adds the trait and shows how it might work for one of the formats (you could temporarily add a new The POC would help figure out what areas might need additional work / that we haven't figured out yet.
@tustvold are you referring to
|
I was referring to plan serialization, it certainly isn't the only way to achieve this, but it might be part of the motivation for the current design |
Interesting, and I'm glad this is becoming more relevant! For my part, I managed to use the StreamingTable implementation to add (to GlareDB) support for reading/writing BSON without needing any of these changes,1 and I'm quite pleased with the results. It isn't a 100% replacement, through the gaps sort of feel like (to me!) cases where the FileFormat/FileType infrastructure is a bit overbuilt, though this might just be application specific concerns
Footnotes |
Is your feature request related to a problem or challenge?
I've been exploring adding new file formats and types to our system (@GlareDB), and have run into the problem where the
FileFormat
trait depends on theFileType
enum, which means it's impossible to implement entirely new filetypes except from within the package.This is sort of gnarly to unwind as I've dug in, and I don't have a particular solution that I favor.
Describe the solution you'd like
I think collapsing
FileType
intoFileFormat
is maybe my favorite solution conceptually, but I think (more narrowly) removing the references toFileType
in theFileFormat
(as in my branch) would be good.Describe alternatives you've considered
There are a few ways to resolve this:
FileFormat
(which kind of makes sense? I'm not sure that these are actually meaningfully separate abstractions.)Additional context
No response
The text was updated successfully, but these errors were encountered: