-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter #35886
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems parquet part change all compression level to options. Do we need to keep api compatible?
1e5c232
to
fdfb1e8
Compare
cpp/src/parquet/column_writer.h
Outdated
@@ -87,8 +88,9 @@ class PARQUET_EXPORT PageWriter { | |||
|
|||
static std::unique_ptr<PageWriter> Open( | |||
std::shared_ptr<ArrowOutputStream> sink, Compression::type codec, | |||
int compression_level, ColumnChunkMetaDataBuilder* metadata, | |||
int16_t row_group_ordinal = -1, int16_t column_chunk_ordinal = -1, | |||
const std::shared_ptr<CodecOptions>& codec_options, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be put to the end for capability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, unless the PageWriter
needs to keep ownership of the codec options, this should simply be const CodecOptions& = {}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the comments above, should be consolidate Compression::type codec
and const std::shared_ptr<CodecOptions>& codec_options
? codec_options should know codec.
cpp/src/parquet/properties.h
Outdated
@@ -192,6 +197,8 @@ class PARQUET_EXPORT ColumnProperties { | |||
|
|||
int compression_level() const { return compression_level_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Codec
contains compression_level_
, would this be duplicated or useless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we have a compression_level
and a CodecOptions
. Previously, user can set compression_level
, and the compression level would be changed.
After this patch, if user change the compression level, the codec will not be changed.
Can we make CodecOptions
mutate it's compression_level
, and make set_compression_level
set the level in CodecOptions
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just get it from codec_options_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this. compression_level_
is duplicated.
cpp/src/arrow/util/compression.h
Outdated
@@ -124,7 +168,9 @@ class ARROW_EXPORT Codec { | |||
|
|||
/// \brief Create a codec for the given compression algorithm | |||
static Result<std::unique_ptr<Codec>> Create( | |||
Compression::type codec, int compression_level = kUseDefaultCompressionLevel); | |||
Compression::type codec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change. It would be better to add a new overload. Then implement this one based on the new overload and label it as deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that could solve the compatibility issue of different APIs. Just to make it clear, you suggest marking the newly added API with CodecOptions as deprecated right? Or the previous one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean deprecating the existing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, that makes sense :)
cpp/src/parquet/column_writer.cc
Outdated
if (buffered_row_group) { | ||
return std::unique_ptr<PageWriter>(new BufferedPageWriter( | ||
std::move(sink), codec, compression_level, metadata, row_group_ordinal, | ||
std::move(sink), codec, codec_options, metadata, row_group_ordinal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we are able to move codec
into codec_options
as well. I don't have a preference here.
#include <cstdint> | ||
#include <cstring> | ||
#include <iostream> | ||
#include <memory> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those changes shouldn't be necessary? (also, why iostream
?)
@@ -221,6 +226,13 @@ class BrotliCodec : public Codec { | |||
return ptr; | |||
} | |||
|
|||
Status Init() override { | |||
if (window_bits_ < BROTLI_MIN_WINDOW_BITS || window_bits_ > BROTLI_MAX_WINDOW_BITS) { | |||
return Status::Invalid("window_bits should be within 10 ~ 24"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the error message respect the macro values:
return Status::Invalid("window_bits should be within 10 ~ 24"); | |
return Status::Invalid("window_bits should be between ", BROTLI_MIN_WINDOW_BITS, | |
" and ", BROTLI_MAX_WINDOW_BITS); |
@@ -17,15 +17,14 @@ | |||
|
|||
#include "arrow/util/compression_internal.h" | |||
|
|||
#include <zconf.h> | |||
#include <zlib.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but can you undo the include ordering change? It's a bit gratuitous. Also, it makes things more readable to separate stdlib imports from third-party library imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that might be due to some local format auto-change, will revert those.
int CompressionWindowBitsForFormat(GZipFormat::type format, int input_window_bits) { | ||
int window_bits = input_window_bits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit convoluted
int CompressionWindowBitsForFormat(GZipFormat::type format, int input_window_bits) { | |
int window_bits = input_window_bits; | |
int CompressionWindowBitsForFormat(GZipFormat::type format, int window_bits) { |
cpp/src/parquet/types.h
Outdated
@@ -487,6 +487,12 @@ bool IsCodecSupported(Compression::type codec); | |||
PARQUET_EXPORT | |||
std::unique_ptr<Codec> GetCodec(Compression::type codec); | |||
|
|||
PARQUET_EXPORT | |||
std::unique_ptr<Codec> GetCodec(Compression::type codec, | |||
const std::shared_ptr<CodecOptions>& codec_options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const CodecOptions&
cpp/src/parquet/types.h
Outdated
const std::shared_ptr<CodecOptions>& codec_options); | ||
|
||
/// \deprecated and left for backwards compatibility. | ||
ARROW_DEPRECATED("Use CodecOptions to create Codec") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I'm not sure it's useful to deprecate this one.
r/src/compression.cpp
Outdated
@@ -23,7 +23,8 @@ | |||
// [[arrow::export]] | |||
std::shared_ptr<arrow::util::Codec> util___Codec__Create(arrow::Compression::type codec, | |||
R_xlen_t compression_level) { | |||
return ValueOrStop(arrow::util::Codec::Create(codec, compression_level)); | |||
return ValueOrStop(arrow::util::Codec::Create( | |||
codec, std::make_shared<CodecOptions>(CodecOptions(compression_level)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can probably revert this.
@yyang52 There are Python test failures on CI. Also, is it possible to pass a base |
Yes, I think it supports that (as checked in TEST(TestCodecMisc, SpecifyCompressionLevel) to pass base CodecOptions to different Codecs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm general ok with the parquet side changes now, there are some comments.
cpp/src/arrow/util/compression.h
Outdated
// ---------------------------------------------------------------------- | ||
// brotli codec options implementation | ||
|
||
constexpr int kBrotliDefaultWindowBits = 22; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it use BROTLI_DEFAULT_WINDOW
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but then we need to include the brotli
header here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right. I'm not familiar with compression algorithms, so maybe I'm asking a stupid question. but I'm afraid that brorli might change it's default bits.
Would it be ok for a optional window_bits
, if it's set, use it, otherwise using default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think that's just how it is implemented now in GZip/BrotliCodecOptions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I notice that this patch hard code the encoder's internal into compression.h
, so I'm afraid if they're changed in the 3rd libraries, we'll get confused here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it seems better to use their BROTLI_DEFAULT_WINDOW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with my previous comment. We can use BROTLI_DEFAULT_WINDOW
in the .cc
file
@@ -40,8 +40,8 @@ std::shared_ptr<Int64Writer> BuildWriter(int64_t output_size, | |||
ColumnDescriptor* schema, | |||
const WriterProperties* properties, | |||
Compression::type codec) { | |||
std::unique_ptr<PageWriter> pager = | |||
PageWriter::Open(dst, codec, Codec::UseDefaultCompressionLevel(), metadata); | |||
std::unique_ptr<PageWriter> pager = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt?
cpp/src/parquet/column_writer.cc
Outdated
if (buffered_row_group) { | ||
return std::unique_ptr<PageWriter>(new BufferedPageWriter( | ||
std::move(sink), codec, compression_level, metadata, row_group_ordinal, | ||
std::move(sink), codec, *codec_options.get(), metadata, row_group_ordinal, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just *codec_options
is ok?
cpp/src/parquet/properties.h
Outdated
@@ -192,6 +197,8 @@ class PARQUET_EXPORT ColumnProperties { | |||
|
|||
int compression_level() const { return compression_level_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we have a compression_level
and a CodecOptions
. Previously, user can set compression_level
, and the compression level would be changed.
After this patch, if user change the compression level, the codec will not be changed.
Can we make CodecOptions
mutate it's compression_level
, and make set_compression_level
set the level in CodecOptions
directly?
May you help to trigger the remaining CI checks again? Thanks! @pitrou |
cpp/src/parquet/column_writer.h
Outdated
OffsetIndexBuilder* offset_index_builder = NULLPTR, | ||
const CodecOptions& codec_options = CodecOptions{}); | ||
|
||
ARROW_DEPRECATED("Use the one with CodecOptions instead") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARROW_DEPRECATED("Use the one with CodecOptions instead") | |
ARROW_DEPRECATED("Deprecated in 13.0.0. Use CodecOptions-taking overload instead.") |
cpp/src/arrow/util/compression.h
Outdated
} | ||
virtual ~CodecOptions() = default; | ||
|
||
int compression_level_ = kUseDefaultCompressionLevel; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot this, but style nit (this is a public member, it's weird to add a trailing underscore):
int compression_level_ = kUseDefaultCompressionLevel; | |
int compression_level = kUseDefaultCompressionLevel; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yyang52 There are a couple of comments left to address, can you take a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet part looks ok to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nice work!
@github-actions crossbow submit -g cpp |
Revision: 73b4018 Submitted crossbow builds: ursacomputing/crossbow @ actions-e0a8b0150b |
Thanks a lot for your contribution @yyang52 ! I hope you didn't mind the delays. |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit a9035cb. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
if (!codec_options) { | ||
pager = PageWriter::Open(sink_, properties_->compression(path), col_meta, | ||
row_group_ordinal_, static_cast<int16_t>(column_ordinal), | ||
properties_->memory_pool(), false, meta_encryptor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part lost buffered_row_group_
, and set it to false. Sorry that I didn't found it at first time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for this missing and thanks for the fix!
…ression parameter (apache#35886) ### Rationale for this change Based on apache#35287, we'd like to add a CodecOptions to make more compression parameters (such as window_bits) customizable when creating the Codec for parquet writer. Authored-by: Yang Yang [yang10.yang@ intel.com](mailto:yang10.yang@ intel.com) Co-authored-by: Rambacher, Mark [mark.rambacher@ intel.com](mailto:mark.rambacher@ intel.com) ### What changes are included in this PR? Add CodecOptions and replace `compression_level` when creating the Codec. The design is basically based on previous discussions. ### Are these changes tested? Yes ### Are there any user-facing changes? Yes, when user creates the `WriterProperties` * Closes: apache#35287 Lead-authored-by: yyang52 <yang10.yang@intel.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
… Parquet column (#38025) ### Rationale for this change After the changes in #35886, getting the compression level for a Parquet column segfaults if the compression level or other options weren't previously set ### What changes are included in this PR? Adds a null check on the codec options of the column properties before trying to access the compression level. ### Are these changes tested? Yes, I added a unit test. ### Are there any user-facing changes? This fixes a regression added after 13.0.0 so isn't a user-facing fix * Closes: #38039 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
… for a Parquet column (apache#38025) ### Rationale for this change After the changes in apache#35886, getting the compression level for a Parquet column segfaults if the compression level or other options weren't previously set ### What changes are included in this PR? Adds a null check on the codec options of the column properties before trying to access the compression level. ### Are these changes tested? Yes, I added a unit test. ### Are there any user-facing changes? This fixes a regression added after 13.0.0 so isn't a user-facing fix * Closes: apache#38039 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
… for a Parquet column (apache#38025) ### Rationale for this change After the changes in apache#35886, getting the compression level for a Parquet column segfaults if the compression level or other options weren't previously set ### What changes are included in this PR? Adds a null check on the codec options of the column properties before trying to access the compression level. ### Are these changes tested? Yes, I added a unit test. ### Are there any user-facing changes? This fixes a regression added after 13.0.0 so isn't a user-facing fix * Closes: apache#38039 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
… for a Parquet column (apache#38025) ### Rationale for this change After the changes in apache#35886, getting the compression level for a Parquet column segfaults if the compression level or other options weren't previously set ### What changes are included in this PR? Adds a null check on the codec options of the column properties before trying to access the compression level. ### Are these changes tested? Yes, I added a unit test. ### Are there any user-facing changes? This fixes a regression added after 13.0.0 so isn't a user-facing fix * Closes: apache#38039 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
Based on #35287, we'd like to add a CodecOptions to make more compression parameters (such as window_bits) customizable when creating the Codec for parquet writer.
Authored-by: Yang Yang yang10.yang@intel.com
Co-authored-by: Rambacher, Mark mark.rambacher@intel.com
What changes are included in this PR?
Add CodecOptions and replace
compression_level
when creating the Codec. The design is basically based on previous discussions.Are these changes tested?
Yes
Are there any user-facing changes?
Yes, when user creates the
WriterProperties