Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BYTE_STREAM_SPLIT encoder/decoder (fixes #208) #221

Conversation

AudriusButkevicius
Copy link

Apologies, this is my first encounter with the rust type system, took a while to even get the decoder to compile and stay generic.

I am sure there could be less phantom data, random .as_ref()s on a separate lines or even .try_into() etc, but someone with more experience in rust type system needs to guide my hand.

Fixes #208

return None
}

let mut buffer = vec![0_u8; self.element_size];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this an allocation on the heap? (the "This struct does not allocate on the heap." on the discussion) Would it make sense allocate this once, and re-use it so that we do not allocate on every call of next?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with how rust works, but I hoped because the lifetime of this slice is within the function scope hence it doesn't need to escape to the healp and would be allocated on the stack. To be honest I wouldn't know how to check this.

Ideally I would have liked to allocate a fixed size slice on Decoder itself to reuse, but seems that it's not possible do so something like:

struct Foo<T> {
   buf: [u8; std::mem::size_of::<T>()]
}

I guess I'll do what you are suggesting and allocate a slice in try_new.

@adamreeve
Copy link

Hi @jorgecarleitao, can you please take another look at this? Or could you take a look @sundy-li?

We have a lot of parquet files using byte stream split encoding as this significantly improves compression for floating point data, and it would be great to be able to read them with polars.

@adamreeve
Copy link

Hi @ritchie46, do you have maintainer permissions here now and is this something you can take a look at? Or is the current plan to move to using the parquet crate in Polars instead (pola-rs/polars#6735)?

Copy link
Collaborator

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. It wasn't under my radar.

Reviewed. We can make it a bit faster by removing a redundant copy.


/// Decodes according to [Byte Stream Split](https://github.com/apache/parquet-format/blob/master/Encodings.md#byte-stream-split-byte_stream_split--9).
/// # Implementation
/// This struct does not allocate on the heap.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does? It holds a buffer of Vec<u8>.

src/encoding/byte_stream_split/decoder.rs Show resolved Hide resolved
src/encoding/byte_stream_split/decoder.rs Show resolved Hide resolved
src/encoding/byte_stream_split/decoder.rs Show resolved Hide resolved
Copy link
Collaborator

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, second review pass. :)

Maybe we can use a const T, see comment.

element_type: PhantomData<T>
}

impl<'a, T: NativeType> Decoder<'a, T> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe make this const on mem::size_of::<T>? Then we can use an array: [u8; Size] instead of a heap allocation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm new to rust, but if I understand you correctly, I'd need to use:

#![feature(generic_const_exprs)]

and add a where clause, akin to:
https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=d8088113ef4552d7f96b987ec4d31c95

Are we ok to depend on non-stable features, or is there a better way to do this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using an unstable feature isn't okay, an alternative approach could be adding an extra type parameter, but this does make the consumer API a little more verbose. Eg: adamreeve@a513845

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a new const type parameter for that. I can take a look later if you want.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please, that would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for byte_stream_split encoding
4 participants