-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hadoop LZ4 Support for LZ4 Codec #3013
Conversation
* Added `CodecOptions` struct to hold `Codec` configuration. * Added `backward_compatible_lz4` option in `CodecOptions`. * Added `CodecOptions` to `ReadOptions` to be able to configure `SerializedFileReader`. * Added `SerializedRowGroupReaderOptionsBuilder` with `CodecOptions` to be able to configure `SerializedRowGroupReader`, with extensible interface. * Added `SerializedPageReaderOptionsBuilder` with `CodecOptions` to be able to configure `SerializedPageReader`, with extensible interface. * Added `new_with_config` to `SerializedPageReader` API to be able to configure `SerializedFileReader` without breaking `new` API. * `CodecOptions` implements `CopyTrait` as it is composed by `Copy` types. If in the future it contains a non `Copy` type, maybe is better to create `CodecOptionsPtr = Arc<CodecOptions>`. * `CodecOptions` is only added in the read path, in the write path the default values are taken, as the options currently only affect the read path and have no effect on write path. If required to add to write path maybe it will be nice to add into `WriteProperties`.
* Added compression and decompression for LZ4_HADOOP.
* Added a test for two parquet files with the same content, both with LZ4 CompressionCodec, but one using the LZ4_HADOOP (no-fallback) algorithm and the other LZ4_RAW algorithm (fallback to last level). * Refactor `LZ4HadoopCodec::compress` function to make it easier to understand.
Another thing I notice with using pub fn new_with_options(chunk_reader: R, options: ReadOptions) -> Result<Self> {
let metadata = footer::parse_metadata(&chunk_reader)?;
let mut predicates = options.predicates; Of course these options are not used anymore at this point, so instead of cloning the vector of predicates we can just take it. pub fn new_with_options(chunk_reader: R, options: ReadOptions) -> Result<Self> {
let metadata = footer::parse_metadata(&chunk_reader)?;
let mut predicates = std::men::take(&mut options.predicates); |
If we want to hide the This problem also implies implementing setters multiple times, but we can use the same @tustvold what do you think? |
Sorry for the delay, I'm currently on holiday. Will try to find time to review today |
Oh, do not worry, enjoy your holidays! We review this when you are back 😄 |
This commits hides `CodecOptions` from the public API. The changes are the following: - Added a new structs to public API `ReaderProperties`, `ReaderPropertiesBuilder` and `ReaderPropertiesPtr` to store inmutable reader config, as it is the case of `CodecOptions`. - Removed `SerializedRowGroupReaderOptions`, `SerializedRowGroupReaderOptionsBuilder`, `SerializedPageReaderOptionsBuilder` and `SerializedPageReaderOptions`. They are not required anymore as `SerializedRowGroupReader` and `SerializedRowGroupReaderOptions` use `ReaderPropertiesPtr` for configuration. - `SerializedRowGroupReader::new_with_options` renamed to `SerializedRowGroupReader::new_with_properties`. - `SerializedPageReader::new_with_options` renamed to `SerializedPageReader::new_with_properties`. - Test added for `ReaderPropertiesBuilder`.
With this last change we remove Commit description:This commits hides
|
I think that now it is everything is ready to merge. Take a look when you have time and we discuss if any change has to be done. About the failing test above I think it is not caused by my commit, as I have not modify IPC. In fact, in my local Mac the test is passing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will give this another once over prior to merge, but looking good. Thank you for your work on this 👍
@@ -2422,6 +2422,76 @@ mod tests { | |||
assert_eq!(a.values(), &[42.000000, 7.700000, 42.125000, 7.700000]); | |||
} | |||
|
|||
// This test is to ensure backward compatibility, it test 2 files containing the LZ4 CompressionCodec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the tests
} | ||
|
||
/// Try to decompress the buffer as if it was compressed with the Hadoop Lz4Codec. | ||
/// Adapted from pola-rs [compression.rs:try_decompress_hadoop](https://pola-rs.github.io/polars/src/parquet2/compression.rs.html#225) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 nice to see attribution
Thanks again 😄 |
Benchmark runs are scheduled for baseline = 488eff0 and contender = 4f525fe. 4f525fe is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Marking as api-change even though backwards compatible as does change the output, and so should be highlighted in changelog |
Which issue does this PR close?
Closes #2988 .
Rationale for this change
What changes are included in this PR?
It adds support for
LZ4_HADOOP
compression forCompressionCodec::LZ4
, with fallback toLZ4_FRAME
andLZ4_RAW
algorithm for backward compatibility with olderarrow-rs
versions and olderparquet-cpp
versions, respectively.It also adds tests for
LZ4_HADOOP
files, and tests for the fallback toLZ4_RAW
(a test including 2 files with same content, sameCompressionCodec::LZ4
, but different compression algorithmLZ4_HADOOP
andLZ4_RAW
).From now on,
CompressionCodec::LZ4
will compress files usingLZ4_HADOOP
algorithm, making LZ4 files generated by this library compatible with other languages arrow implementations.The user interface is changed to allow the user to configure whether to fallback or not, by allowing the user to disable backward compatibility with
no_backward_compatible_lz4
(WIP in this PR).Are there any user-facing changes?
The user interface is changed to allow the user to configure whether to fallback or not, by allowing the user to disable backward compatibility with
no_backward_compatible_lz4
(WIP in this PR).New structs have been created:
SerializedRowGroupReaderOptions
andSerializedPageReaderOptions
to allow the users to set theCodecOptions
forSerializedRowGroupReader
andSerializedPageReader
. The builders for these structs are also added to public interface.SerializedRowGroupReader::new_with_options
andSerializedPageReader::new_with_options
have been created to allow to configure these components without breaking public API.NOTE: The API changes are WIP, however, it is not possible to follow the same approach that with
WriterProperties
and wrapReadOptions
in anArc
as its fieldpredicates: Vec<ReadGroupPredicate>
is not thread safe. Other option would be to wrap it in just anRc
sotype ReadOptionsPtr = Rc<ReadOptions>
.