Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35287: [C++][Parquet] Add CodecOptions to customize the compression parameter #35886

Merged
merged 16 commits into from
Jul 11, 2023

Conversation

yyang52
Copy link
Contributor

@yyang52 yyang52 commented Jun 2, 2023

Rationale for this change

Based on #35287, we'd like to add a CodecOptions to make more compression parameters (such as window_bits) customizable when creating the Codec for parquet writer.

Authored-by: Yang Yang yang10.yang@intel.com
Co-authored-by: Rambacher, Mark mark.rambacher@intel.com

What changes are included in this PR?

Add CodecOptions and replace compression_level when creating the Codec. The design is basically based on previous discussions.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, when user creates the WriterProperties

@github-actions
Copy link

github-actions bot commented Jun 2, 2023

⚠️ GitHub issue #35287 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems parquet part change all compression level to options. Do we need to keep api compatible?

cpp/src/parquet/properties.h Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 2, 2023
@yyang52 yyang52 force-pushed the codec_options branch 2 times, most recently from 1e5c232 to fdfb1e8 Compare June 5, 2023 02:55
@@ -87,8 +88,9 @@ class PARQUET_EXPORT PageWriter {

static std::unique_ptr<PageWriter> Open(
std::shared_ptr<ArrowOutputStream> sink, Compression::type codec,
int compression_level, ColumnChunkMetaDataBuilder* metadata,
int16_t row_group_ordinal = -1, int16_t column_chunk_ordinal = -1,
const std::shared_ptr<CodecOptions>& codec_options,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be put to the end for capability

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, unless the PageWriter needs to keep ownership of the codec options, this should simply be const CodecOptions& = {}.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the comments above, should be consolidate Compression::type codec and const std::shared_ptr<CodecOptions>& codec_options? codec_options should know codec.

@@ -192,6 +197,8 @@ class PARQUET_EXPORT ColumnProperties {

int compression_level() const { return compression_level_; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Codec contains compression_level_, would this be duplicated or useless?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we have a compression_level and a CodecOptions. Previously, user can set compression_level, and the compression level would be changed.

After this patch, if user change the compression level, the codec will not be changed.

Can we make CodecOptions mutate it's compression_level, and make set_compression_level set the level in CodecOptions directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just get it from codec_options_?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this. compression_level_ is duplicated.

@mapleFU
Copy link
Member

mapleFU commented Jun 5, 2023

cc @pitrou @wgtmac

@@ -124,7 +168,9 @@ class ARROW_EXPORT Codec {

/// \brief Create a codec for the given compression algorithm
static Result<std::unique_ptr<Codec>> Create(
Compression::type codec, int compression_level = kUseDefaultCompressionLevel);
Compression::type codec,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. It would be better to add a new overload. Then implement this one based on the new overload and label it as deprecated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that could solve the compatibility issue of different APIs. Just to make it clear, you suggest marking the newly added API with CodecOptions as deprecated right? Or the previous one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean deprecating the existing one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, that makes sense :)

cpp/src/parquet/types.h Show resolved Hide resolved
if (buffered_row_group) {
return std::unique_ptr<PageWriter>(new BufferedPageWriter(
std::move(sink), codec, compression_level, metadata, row_group_ordinal,
std::move(sink), codec, codec_options, metadata, row_group_ordinal,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we are able to move codec into codec_options as well. I don't have a preference here.

#include <cstdint>
#include <cstring>
#include <iostream>
#include <memory>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those changes shouldn't be necessary? (also, why iostream?)

@@ -221,6 +226,13 @@ class BrotliCodec : public Codec {
return ptr;
}

Status Init() override {
if (window_bits_ < BROTLI_MIN_WINDOW_BITS || window_bits_ > BROTLI_MAX_WINDOW_BITS) {
return Status::Invalid("window_bits should be within 10 ~ 24");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make the error message respect the macro values:

Suggested change
return Status::Invalid("window_bits should be within 10 ~ 24");
return Status::Invalid("window_bits should be between ", BROTLI_MIN_WINDOW_BITS,
" and ", BROTLI_MAX_WINDOW_BITS);

@@ -17,15 +17,14 @@

#include "arrow/util/compression_internal.h"

#include <zconf.h>
#include <zlib.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, but can you undo the include ordering change? It's a bit gratuitous. Also, it makes things more readable to separate stdlib imports from third-party library imports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that might be due to some local format auto-change, will revert those.

cpp/src/arrow/util/compression_zlib.cc Show resolved Hide resolved
Comment on lines 60 to 61
int CompressionWindowBitsForFormat(GZipFormat::type format, int input_window_bits) {
int window_bits = input_window_bits;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit convoluted

Suggested change
int CompressionWindowBitsForFormat(GZipFormat::type format, int input_window_bits) {
int window_bits = input_window_bits;
int CompressionWindowBitsForFormat(GZipFormat::type format, int window_bits) {

cpp/src/parquet/column_writer.h Show resolved Hide resolved
@@ -487,6 +487,12 @@ bool IsCodecSupported(Compression::type codec);
PARQUET_EXPORT
std::unique_ptr<Codec> GetCodec(Compression::type codec);

PARQUET_EXPORT
std::unique_ptr<Codec> GetCodec(Compression::type codec,
const std::shared_ptr<CodecOptions>& codec_options);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const CodecOptions&

const std::shared_ptr<CodecOptions>& codec_options);

/// \deprecated and left for backwards compatibility.
ARROW_DEPRECATED("Use CodecOptions to create Codec")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm not sure it's useful to deprecate this one.

cpp/src/parquet/types.cc Show resolved Hide resolved
@@ -23,7 +23,8 @@
// [[arrow::export]]
std::shared_ptr<arrow::util::Codec> util___Codec__Create(arrow::Compression::type codec,
R_xlen_t compression_level) {
return ValueOrStop(arrow::util::Codec::Create(codec, compression_level));
return ValueOrStop(arrow::util::Codec::Create(
codec, std::make_shared<CodecOptions>(CodecOptions(compression_level))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably revert this.

@pitrou
Copy link
Member

pitrou commented Jun 6, 2023

@yyang52 There are Python test failures on CI.

Also, is it possible to pass a base CodecOptions when a derived class exists? (e.g. for gzip). Ideally it should be supported.

@yyang52
Copy link
Contributor Author

yyang52 commented Jun 8, 2023

@yyang52 There are Python test failures on CI.

Also, is it possible to pass a base CodecOptions when a derived class exists? (e.g. for gzip). Ideally it should be supported.

Yes, I think it supports that (as checked in TEST(TestCodecMisc, SpecifyCompressionLevel) to pass base CodecOptions to different Codecs)

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm general ok with the parquet side changes now, there are some comments.

// ----------------------------------------------------------------------
// brotli codec options implementation

constexpr int kBrotliDefaultWindowBits = 22;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it use BROTLI_DEFAULT_WINDOW?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but then we need to include the brotli header here as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right. I'm not familiar with compression algorithms, so maybe I'm asking a stupid question. but I'm afraid that brorli might change it's default bits.

Would it be ok for a optional window_bits, if it's set, use it, otherwise using default value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think that's just how it is implemented now in GZip/BrotliCodecOptions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I notice that this patch hard code the encoder's internal into compression.h, so I'm afraid if they're changed in the 3rd libraries, we'll get confused here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it seems better to use their BROTLI_DEFAULT_WINDOW

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with my previous comment. We can use BROTLI_DEFAULT_WINDOW in the .cc file

@@ -40,8 +40,8 @@ std::shared_ptr<Int64Writer> BuildWriter(int64_t output_size,
ColumnDescriptor* schema,
const WriterProperties* properties,
Compression::type codec) {
std::unique_ptr<PageWriter> pager =
PageWriter::Open(dst, codec, Codec::UseDefaultCompressionLevel(), metadata);
std::unique_ptr<PageWriter> pager =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt?

if (buffered_row_group) {
return std::unique_ptr<PageWriter>(new BufferedPageWriter(
std::move(sink), codec, compression_level, metadata, row_group_ordinal,
std::move(sink), codec, *codec_options.get(), metadata, row_group_ordinal,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just *codec_options is ok?

cpp/src/arrow/util/compression.h Show resolved Hide resolved
@@ -192,6 +197,8 @@ class PARQUET_EXPORT ColumnProperties {

int compression_level() const { return compression_level_; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we have a compression_level and a CodecOptions. Previously, user can set compression_level, and the compression level would be changed.

After this patch, if user change the compression level, the codec will not be changed.

Can we make CodecOptions mutate it's compression_level, and make set_compression_level set the level in CodecOptions directly?

@yyang52
Copy link
Contributor Author

yyang52 commented Jul 6, 2023

May you help to review the code change as well as trigger the remaining CI checks?

May you help to trigger the remaining CI checks again? Thanks! @pitrou

OffsetIndexBuilder* offset_index_builder = NULLPTR,
const CodecOptions& codec_options = CodecOptions{});

ARROW_DEPRECATED("Use the one with CodecOptions instead")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ARROW_DEPRECATED("Use the one with CodecOptions instead")
ARROW_DEPRECATED("Deprecated in 13.0.0. Use CodecOptions-taking overload instead.")

}
virtual ~CodecOptions() = default;

int compression_level_ = kUseDefaultCompressionLevel;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot this, but style nit (this is a public member, it's weird to add a trailing underscore):

Suggested change
int compression_level_ = kUseDefaultCompressionLevel;
int compression_level = kUseDefaultCompressionLevel;

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yyang52 There are a couple of comments left to address, can you take a look?

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet part looks ok to me

@yyang52
Copy link
Contributor Author

yyang52 commented Jul 7, 2023

@yyang52 There are a couple of comments left to address, can you take a look?

Doing some fix, may you help to take a look and trigger remaining CI checks? Thanks! @pitrou

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice work!

@pitrou
Copy link
Member

pitrou commented Jul 11, 2023

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: 73b4018

Submitted crossbow builds: ursacomputing/crossbow @ actions-e0a8b0150b

Task Status
test-alpine-linux-cpp Github Actions
test-build-cpp-fuzz Github Actions
test-conda-cpp Github Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp Github Actions
test-debian-11-cpp-amd64 Github Actions
test-debian-11-cpp-i386 Github Actions
test-fedora-35-cpp Github Actions
test-ubuntu-20.04-cpp Github Actions
test-ubuntu-20.04-cpp-bundled Github Actions
test-ubuntu-20.04-cpp-minimal-with-formats Github Actions
test-ubuntu-20.04-cpp-thread-sanitizer Github Actions
test-ubuntu-22.04-cpp Github Actions
test-ubuntu-22.04-cpp-20 Github Actions

@pitrou pitrou merged commit a9035cb into apache:main Jul 11, 2023
29 of 34 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jul 11, 2023
@pitrou
Copy link
Member

pitrou commented Jul 11, 2023

Thanks a lot for your contribution @yyang52 ! I hope you didn't mind the delays.

@yyang52
Copy link
Contributor Author

yyang52 commented Jul 12, 2023

Thanks for your kind support! It's a pleasure to make contribution to Arrow community! @pitrou @mapleFU @wgtmac

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit a9035cb.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

if (!codec_options) {
pager = PageWriter::Open(sink_, properties_->compression(path), col_meta,
row_group_ordinal_, static_cast<int16_t>(column_ordinal),
properties_->memory_pool(), false, meta_encryptor,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part lost buffered_row_group_, and set it to false. Sorry that I didn't found it at first time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for this missing and thanks for the fix!

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Jul 23, 2023
R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this pull request Aug 20, 2023
…ression parameter (apache#35886)

### Rationale for this change

Based on apache#35287, we'd like to add a CodecOptions to make more compression parameters (such as window_bits) customizable when creating the Codec for parquet writer.

Authored-by: Yang Yang [yang10.yang@ intel.com](mailto:yang10.yang@ intel.com)
Co-authored-by: Rambacher, Mark [mark.rambacher@ intel.com](mailto:mark.rambacher@ intel.com)

### What changes are included in this PR?

Add CodecOptions and replace `compression_level` when creating the Codec. The design is basically based on previous discussions. 
### Are these changes tested?

Yes
### Are there any user-facing changes?

Yes, when user creates the `WriterProperties`

* Closes: apache#35287

Lead-authored-by: yyang52 <yang10.yang@intel.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
pitrou pushed a commit that referenced this pull request Oct 5, 2023
… Parquet column (#38025)

### Rationale for this change

After the changes in #35886, getting the compression level for a Parquet column segfaults if the compression level or other options weren't previously set

### What changes are included in this PR?

Adds a null check on the codec options of the column  properties before trying to access the compression level.

### Are these changes tested?

Yes, I added a unit test.

### Are there any user-facing changes?

This fixes a regression added after 13.0.0 so isn't a user-facing fix
* Closes: #38039

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
… for a Parquet column (apache#38025)

### Rationale for this change

After the changes in apache#35886, getting the compression level for a Parquet column segfaults if the compression level or other options weren't previously set

### What changes are included in this PR?

Adds a null check on the codec options of the column  properties before trying to access the compression level.

### Are these changes tested?

Yes, I added a unit test.

### Are there any user-facing changes?

This fixes a regression added after 13.0.0 so isn't a user-facing fix
* Closes: apache#38039

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
… for a Parquet column (apache#38025)

### Rationale for this change

After the changes in apache#35886, getting the compression level for a Parquet column segfaults if the compression level or other options weren't previously set

### What changes are included in this PR?

Adds a null check on the codec options of the column  properties before trying to access the compression level.

### Are these changes tested?

Yes, I added a unit test.

### Are there any user-facing changes?

This fixes a regression added after 13.0.0 so isn't a user-facing fix
* Closes: apache#38039

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
… for a Parquet column (apache#38025)

### Rationale for this change

After the changes in apache#35886, getting the compression level for a Parquet column segfaults if the compression level or other options weren't previously set

### What changes are included in this PR?

Adds a null check on the codec options of the column  properties before trying to access the compression level.

### Are these changes tested?

Yes, I added a unit test.

### Are there any user-facing changes?

This fixes a regression added after 13.0.0 so isn't a user-facing fix
* Closes: apache#38039

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Expose the API to fine tune the window_bits parameter of ZLIB compression
4 participants