-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Standard Library IO Abstractions in Parquet #1163
Comments
I think the difference I have observed in the past was that the parquet abstractions also allow There may well be a better way to handle this |
FWIW I think it would be great to avoid all the custom abstractions 👍 |
Aah yes, the However, looking at the code I'm not sure it can actually be used concurrently. It doesn't appear to have any synchronisation, and yet is using the same underlying file descriptor. I think it might be possible to make the code race if used in such a way... Some testing is needed 🤔 Edit: In fact it is impossible to use the reader in this way as the |
I had a brief play around with this and found the following. The write side is serial, and so it should be possible to use standard library abstractions. The current trait topology will likely require using The read side is more complicated, the problem can be seen in It occurs to me that one of the things an async API able to support object stores will need is a sparse file abstraction, ultimately this is what the read path wants. I'll therefore park this for now, and see what shakes out of that. |
Also related #937 |
Hi @tustvold ! I am the weird mind who introduced the ChunkReader 😄. The main benefit compared to a regular reader is that you can specify the size of the chunk you plan to read, which enables you to set the right range when you call GET on the object store. Not sure how we could achieve this with plain Another initial goal was indeed to try to achieve parallelism between columns, but I never succeeded because the entire structure of the parquet reader was against it, and I didn't have enough Rust experience to fight it 😉. |
Hi @rdettai, I realize I never responded directly to your comment. I have been documenting my experiments on the parquet API in #1473, which I think overlaps a fair deal with the points you raise about the object store interface and introducing more parallelism into the parquet reader. I would be very interested in hearing any thoughts you might have |
Coming back to this, we now have OffsetIndex integration which lets us read pages without requiring buffered reading. With #2478 this integrates very nicely with in-memory data. If it is acceptable to pre-fetch the entire column chunk in the absence of an OffsetIndex, which FWIW is what we do with async IO, then we can likely greatly simplify the implementation, removing FileSource, etc... |
The results in tustvold/access-log-bench#2 would definitely suggest this idea may have merit |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The parquet crate has a number of IO abstractions that at least on the surface appear very similar to those in the Rust standard library.
The major distinction appears to concern mutability, with various components using a mixture of
TryClone
andRefCell
internally to placate the borrow checker. This makes the code fairly hard to reason about, as cloned file descriptors share the same seek position. Additionally it prevents creating readers from&mut File
or similar.At a more holistic level, I'm not really sure of the use-case for non-exclusive IO, but I could be missing something here?
A non-exhaustive list of potential candidates for replacement:
Postition
-std::io::seek(SeekFrom::Current(0))
Length
-std::io::seek(SeekFrom::End(0))
ChunkReader
-std::io::Read
SliceableCursor
-std::io::Cursor
ParquetReader
-std::io::Seek + std::io::Read
ParquetWriter
-std::io::Seek + std::io::Write
FileSink
-BufWriter<File>
FileSource
-BufReader<File>
Describe the solution you'd like
I would ideally like to be able to create a parquet reader with anything implementing
std::io::Seek
andstd::io::Read
.Similarly I would like to be able to create parquet writer with anything implementing
std::io::Seek
andstd::io::Write
.These are standard traits within the Rust ecosystem and supporting them will simplify inter-operation with other crates, and reduce cognitive load on users and contributors.
The blanket implementations on these traits will for free allow using mutable references, instead of owned values.
Describe alternatives you've considered
Preserve the current behaviour
The text was updated successfully, but these errors were encountered: