Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41579: [C++][Python][Parquet] Support reading/writing key-value metadata from/to ColumnChunkMetaData #41580

Merged
merged 22 commits into from
Aug 15, 2024

Conversation

clee704
Copy link
Contributor

@clee704 clee704 commented May 8, 2024

Rationale for this change

Parquet standard allows reading/writing key-value metadata from/to ColumnChunkMetaData, but there is no way to do that with Parquet C++.

What changes are included in this PR?

Support reading/writing key-value metadata from/to ColumnChunkMetaData with Parquet C++ reader/writer. Support reading key-value metadata from ColumnChunkMetaData with pyarrow.parquet.

Are these changes tested?

Yes, unit tests are added

Are there any user-facing changes?

Yes.

  • Users can read or write key-value metadata for column chunks with Parquet C++.
  • Users can read key-value metadata for column chunks with PyArrow.
  • parquet-reader tool prints key-value metadata in column chunks when --print-key-value-metadata option is used.

Copy link

github-actions bot commented May 8, 2024

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@clee704 clee704 changed the title Support reading/writing key-value metadata from/to ColumnMetaData [GH-41579] Support reading/writing key-value metadata from/to ColumnMetaData May 8, 2024
@clee704 clee704 changed the title [GH-41579] Support reading/writing key-value metadata from/to ColumnMetaData GH-41579: [C++] Support reading/writing key-value metadata from/to ColumnMetaData May 8, 2024
@clee704 clee704 changed the title GH-41579: [C++] Support reading/writing key-value metadata from/to ColumnMetaData GH-41579: [C++][Python] Support reading/writing key-value metadata from/to ColumnMetaData May 8, 2024
Copy link

github-actions bot commented May 8, 2024

⚠️ GitHub issue #41579 has been automatically assigned in GitHub to PR creator.

@clee704 clee704 changed the title GH-41579: [C++][Python] Support reading/writing key-value metadata from/to ColumnMetaData GH-41579: [C++][Python] Support reading/writing key-value metadata from/to ColumnChunkMetaData May 8, 2024
@clee704 clee704 marked this pull request as ready for review May 8, 2024 05:00
@clee704 clee704 requested a review from wgtmac as a code owner May 8, 2024 05:00
@mapleFU mapleFU changed the title GH-41579: [C++][Python] Support reading/writing key-value metadata from/to ColumnChunkMetaData GH-41579: [C++][Python][Parquet] Support reading/writing key-value metadata from/to ColumnChunkMetaData May 8, 2024
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea general LGTM

cc @pitrou @emkornfield @wgtmac

cpp/src/parquet/metadata.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 12, 2024
@mapleFU mapleFU requested a review from pitrou May 12, 2024 05:46
@wgtmac
Copy link
Member

wgtmac commented May 12, 2024

I saw a related issue: https://issues.apache.org/jira/browse/PARQUET-1648. It seems that parquet-mr does not use it yet.

@mapleFU
Copy link
Member

mapleFU commented May 12, 2024

It seems that parquet-mr does not use it yet.

Can parquet-mr reads that?

@mapleFU
Copy link
Member

mapleFU commented May 12, 2024

🤔at least this patch is ok and seems other implementation has thrift, it doesn't break the standard..

@wgtmac
Copy link
Member

wgtmac commented May 12, 2024

I agree. It should not block us from implementing this.

cpp/src/parquet/metadata.cc Outdated Show resolved Hide resolved
cpp/src/parquet/column_writer.h Outdated Show resolved Hide resolved
cpp/src/parquet/column_writer.h Outdated Show resolved Hide resolved
cpp/src/parquet/column_writer.h Outdated Show resolved Hide resolved
cpp/src/parquet/column_writer.h Outdated Show resolved Hide resolved
cpp/src/parquet/metadata.h Outdated Show resolved Hide resolved
cpp/src/parquet/printer.cc Show resolved Hide resolved
python/pyarrow/_parquet.pyx Show resolved Hide resolved
python/pyarrow/tests/parquet/test_metadata.py Outdated Show resolved Hide resolved
python/pyarrow/tests/parquet/test_metadata.py Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented May 13, 2024

@clee704 Could you please submit a test file to https://github.com/apache/parquet-testing instead of adding it in this PR?

@clee704
Copy link
Contributor Author

clee704 commented May 29, 2024

@clee704 Could you please submit a test file to https://github.com/apache/parquet-testing instead of adding it in this PR?

Created apache/parquet-testing#49

@mapleFU
Copy link
Member

mapleFU commented Aug 3, 2024

@clee704 Sorry for delaying, I've update this patch myself. This patch LGTM and I'm willing to merge it before August 6th. Would you mind re-checking this?

@mapleFU
Copy link
Member

mapleFU commented Aug 4, 2024

Would check CPython build

 Error compiling Cython file:
  ------------------------------------------------------------
  ...
          return self.metadata.GetColumnIndexLocation().has_value()

      @property
      def metadata(self):
          """Additional metadata as key value pairs (dict[bytes, bytes])."""
          wrapped = pyarrow_wrap_metadata(self.metadata.key_value_metadata())
                    ^
  ------------------------------------------------------------

  pyarrow/_parquet.pyx:514:18: undeclared name not builtin: pyarrow_wrap_metadata

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
          return self.metadata.GetColumnIndexLocation().has_value()

      @property
      def metadata(self):
          """Additional metadata as key value pairs (dict[bytes, bytes])."""
          wrapped = pyarrow_wrap_metadata(self.metadata.key_value_metadata())
                                                                          ^
  ------------------------------------------------------------

  pyarrow/_parquet.pyx:514:72: Cannot convert 'shared_ptr[const CKeyValueMetadata]' to Python object
  [17/136] Compiling Cython CXX source for _dataset...
  performance hint: pyarrow/_dataset.pyx:1947:35: Exception check after calling 'GetResultValue[shared_ptr[CRandomAccessFile]]' will always require the GIL to be acquired. Declare 'GetResultValue[shared_ptr[CRandomAccessFile]]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: pyarrow/_dataset.pyx:3343:41: Exception check after calling 'GetResultValue[shared_ptr[CRecordBatch]]' will always require the GIL to be acquired. Declare 'GetResultValue[shared_ptr[CRecordBatch]]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: pyarrow/_dataset.pyx:3387:34: Exception check after calling 'GetResultValue[CTaggedRecordBatch]' will always require the GIL to be acquired. Declare 'GetResultValue[CTaggedRecordBatch]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  performance hint: pyarrow/_dataset.pyx:3784:42: Exception check after calling 'GetResultValue[CTaggedRecordBatchIterator]' will always require the GIL to be acquired. Declare 'GetResultValue[CTaggedRecordBatchIterator]' as 'noexcept' if you control the definition and you're sure you don't want the function to raise exceptions.
  [18/136] Building CXX object CMakeFiles/arrow_python.dir/pyarrow/src/arrow/python/datetime.cc.o

@mapleFU mapleFU requested a review from pitrou August 6, 2024 02:32
@mapleFU
Copy link
Member

mapleFU commented Aug 7, 2024

@wgtmac @pitrou Do you have any more comments here? Lets move forward now

@mapleFU
Copy link
Member

mapleFU commented Aug 13, 2024

Would merge this week if no negative comments

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the C++ side

@mapleFU
Copy link
Member

mapleFU commented Aug 15, 2024

@pitrou Would you mind take a look at this? I'm planning to merge this pr this week if no negative comments


TEST_F(TestInt32Writer, WriteKeyValueMetadata) {
auto writer = this->BuildWriter();
writer->AddKeyValueMetadata(KeyValueMetadata::Make({"hello"}, {"world"}));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test where AddKeyValueMetadata is called twice?

@@ -135,6 +135,36 @@ std::shared_ptr<Statistics> MakeColumnStats(const format::ColumnMetaData& meta_d
throw ParquetException("Can't decode page statistics for selected column type");
}

template <typename Metadata>
std::shared_ptr<KeyValueMetadata> CopyKeyValueMetadata(const Metadata& source) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps call this FromThriftKeyValueMetadata?

@pitrou pitrou self-requested a review August 15, 2024 14:34
@pitrou
Copy link
Member

pitrou commented Aug 15, 2024

Sorry, I approved while the two comments should probably be addressed before merging this.

@mapleFU mapleFU merged commit 2767dc5 into apache:main Aug 15, 2024
33 of 34 checks passed
@mapleFU mapleFU removed the awaiting committer review Awaiting committer review label Aug 15, 2024
@mapleFU
Copy link
Member

mapleFU commented Aug 15, 2024

Thanks all, merged!

Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 2767dc5.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them.

Comment on lines +31 to +37
@pytest.fixture(scope='module')
def parquet_test_datadir():
result = os.environ.get('PARQUET_TEST_DATA')
if not result:
raise RuntimeError('Please point the PARQUET_TEST_DATA environment '
'variable to the test data directory')
return pathlib.Path(result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time we introduce a pyarrow test that requires this to be set up. Do we want to require that strictly, or should we skip the test if the env variable is not set?

In any case this caused a failure in one of the nightly crossbow builds which doesn't have this env variable set (python-emscriptem, https://github.com/ursacomputing/crossbow/actions/runs/10463078918/job/28974494645#step:7:15654)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both is ok for me but I prefer read that 🤔

@github-actions github-actions bot added the awaiting changes Awaiting changes label Aug 20, 2024
jorisvandenbossche added a commit that referenced this pull request Aug 22, 2024
…43786)

### Rationale for this change

Starting with #41580, the pyarrow tests now also rely on a file in the parquet-testing submodule. And the path to that directory is controlled by `PARQUET_TEST_DATA`, which appears to be set wrongly in the wheel test scripts, causing all wheel builds to fail at the moment.
 
* GitHub Issue: #43785

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
kou pushed a commit that referenced this pull request Sep 18, 2024
…on emscripten (#43906)

### Rationale for this change

The following PR:
- #41580

Made mandatory for a test the requirement to have `PARQUET_TEST_DATA` env defined.

This is currently not available from `python_test_emscripten.sh` as we require to mount the filesystem for both Node and ChromeDriver.

### What changes are included in this PR?

Skip the test that requires `PARQUET_TEST_DATA` for emscripten.

### Are these changes tested?

Via archery

### Are there any user-facing changes?

No
* GitHub Issue: #43905
* GitHub Issue: #43868

Lead-authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants