-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BYTE_STREAM_SPLIT encoder/decoder (fixes #208) #221
Add BYTE_STREAM_SPLIT encoder/decoder (fixes #208) #221
Conversation
return None | ||
} | ||
|
||
let mut buffer = vec![0_u8; self.element_size]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this an allocation on the heap? (the "This struct does not allocate on the heap." on the discussion) Would it make sense allocate this once, and re-use it so that we do not allocate on every call of next
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with how rust works, but I hoped because the lifetime of this slice is within the function scope hence it doesn't need to escape to the healp and would be allocated on the stack. To be honest I wouldn't know how to check this.
Ideally I would have liked to allocate a fixed size slice on Decoder itself to reuse, but seems that it's not possible do so something like:
struct Foo<T> {
buf: [u8; std::mem::size_of::<T>()]
}
I guess I'll do what you are suggesting and allocate a slice in try_new.
Hi @jorgecarleitao, can you please take another look at this? Or could you take a look @sundy-li? We have a lot of parquet files using byte stream split encoding as this significantly improves compression for floating point data, and it would be great to be able to read them with polars. |
Hi @ritchie46, do you have maintainer permissions here now and is this something you can take a look at? Or is the current plan to move to using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. It wasn't under my radar.
Reviewed. We can make it a bit faster by removing a redundant copy.
|
||
/// Decodes according to [Byte Stream Split](https://github.com/apache/parquet-format/blob/master/Encodings.md#byte-stream-split-byte_stream_split--9). | ||
/// # Implementation | ||
/// This struct does not allocate on the heap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does? It holds a buffer
of Vec<u8>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, second review pass. :)
Maybe we can use a const T
, see comment.
element_type: PhantomData<T> | ||
} | ||
|
||
impl<'a, T: NativeType> Decoder<'a, T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we maybe make this const
on mem::size_of::<T>
? Then we can use an array: [u8; Size]
instead of a heap allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm new to rust, but if I understand you correctly, I'd need to use:
#![feature(generic_const_exprs)]
and add a where clause, akin to:
https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=d8088113ef4552d7f96b987ec4d31c95
Are we ok to depend on non-stable features, or is there a better way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If using an unstable feature isn't okay, an alternative approach could be adding an extra type parameter, but this does make the consumer API a little more verbose. Eg: adamreeve@a513845
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a new const
type parameter for that. I can take a look later if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please, that would be great.
Apologies, this is my first encounter with the rust type system, took a while to even get the decoder to compile and stay generic.
I am sure there could be less phantom data, random .as_ref()s on a separate lines or even .try_into() etc, but someone with more experience in rust type system needs to guide my hand.
Fixes #208