Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35775: [Go][Parquet] Allow key value file metadata to be written after writing row groups #37786

Merged
merged 2 commits into from
Sep 21, 2023

Conversation

tschaub
Copy link
Contributor

@tschaub tschaub commented Sep 19, 2023

Rationale for this change

The key value file metadata may include information generated while writing row groups. Currently, it is not possible to set the key value file metadata after creating a writer. With the changes in this branch, key value pairs may be added any time before closing the writer.

What changes are included in this PR?

This branch adds a writer.AppendKeyValueMetadata(key, value) method to the parquet file.Writer and to the pqarrow.FileWriter.

Are these changes tested?

Tests are added for the new functionality.

Are there any user-facing changes?

The KeyValueMetadata field on the parquet file.Writer has been renamed to initialKeyValueMetadata. This is a breaking change. Although the field was exported, setting it did not result in new key value metadata being written. Instead, it represented the initial key value metadata if the writer was passed the WithWriteMetadata write option.

The WithWriteMetadata option can still be used to provide the initial key value metadata values. In addition, the AppendKeyValueMetadata method can be called to add key value pairs after creating a writer.

The FileMetadata field on the parquet file.Writer has been removed. Previously, setting this field value had no effect.

This PR includes breaking changes to public APIs.

The KeyValueMetadata field is no longer exported from the parquet file.Writer struct. Use the WithWriteMetadata writer option to set key value metadata when creating a writer or use the AppendKeyValueMetadata method to add key value metadata after creating a writer.

The FileMetadata field on the parquet file.Writer has been removed.

@github-actions
Copy link

⚠️ GitHub issue #35775 has been automatically assigned in GitHub to PR creator.

@zeroshade zeroshade added Breaking Change Includes a breaking change to the API Component: Parquet labels Sep 19, 2023
@zeroshade zeroshade changed the title GH-35775: [Go] Allow key value file metadata to be written after writing row groups GH-35775: [Go][Parquet] Allow key value file metadata to be written after writing row groups Sep 19, 2023
Comment on lines 35 to 41
props *parquet.WriterProperties
rowGroups int
nrows int
metadata metadata.FileMetaDataBuilder
fileEncryptor encryption.FileEncryptor
rowGroupWriter *rowGroupWriter
initialKeyValueMetadata metadata.KeyValueMetadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we just initialize the FileMetaDataBuilder with the initialKeyValueMetadata, we don't actually need to store it after we initialize things.

We could probably modify the WriteOption and how the config works by introducing a config struct like type config struct { props *parquet.WriterProperties; initialKVMetadata metadata.KeyValueMetadata } and have WriteOption modify the config struct to initialize the writer. This would let us remove the initialKeyValueMetadata member entirely from the Writer.

It's not necessary (things work just fine currently and I think this is good) but it's an idea.

Copy link
Contributor Author

@tschaub tschaub Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zeroshade - I'm happy to make the change to the WriteOption interface if you think that makes sense. I wasn't sure how much of a breaking change would be acceptable.

I could add tschaub@9e5f7c1 to this branch if that sounds good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, let's add that to this branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit added. It looks like the failing Dev workflow is unrelated.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Sep 19, 2023
Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zeroshade zeroshade merged commit 6eb08dd into apache:main Sep 21, 2023
22 of 25 checks passed
@zeroshade zeroshade removed the awaiting change review Awaiting change review label Sep 21, 2023
@github-actions github-actions bot added the awaiting merge Awaiting merge label Sep 21, 2023
@tschaub tschaub deleted the file-metadata branch September 21, 2023 02:17
@tschaub
Copy link
Contributor Author

tschaub commented Sep 21, 2023

Thanks, @zeroshade

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 6eb08dd.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

etseidl pushed a commit to etseidl/arrow that referenced this pull request Sep 28, 2023
…tten after writing row groups (apache#37786)

### Rationale for this change

The key value file metadata may include information generated while writing row groups.  Currently, it is not possible to set the key value file metadata after creating a writer.  With the changes in this branch, key value pairs may be added any time before closing the writer.

### What changes are included in this PR?

This branch adds a `writer.AppendKeyValueMetadata(key, value)` method to the parquet `file.Writer` and to the `pqarrow.FileWriter`.

### Are these changes tested?

Tests are added for the new functionality.

### Are there any user-facing changes?

The `KeyValueMetadata` field on the parquet `file.Writer` has been renamed to `initialKeyValueMetadata`.  This is a breaking change.  Although the field was exported, setting it did not result in new key value metadata being written.  Instead, it represented the initial key value metadata if the writer was passed the `WithWriteMetadata` write option.

The `WithWriteMetadata` option can still be used to provide the initial key value metadata values.  In addition, the `AppendKeyValueMetadata` method can be called to add key value pairs after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.  Previously, setting this field value had no effect.

**This PR includes breaking changes to public APIs.** 

The `KeyValueMetadata` field is no longer exported from the parquet `file.Writer` struct.  Use the `WithWriteMetadata` writer option to set key value metadata when creating a writer or use the `AppendKeyValueMetadata` method to add key value metadata after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.
* Closes: apache#35775

Authored-by: Tim Schaub <tim@planet.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…tten after writing row groups (apache#37786)

### Rationale for this change

The key value file metadata may include information generated while writing row groups.  Currently, it is not possible to set the key value file metadata after creating a writer.  With the changes in this branch, key value pairs may be added any time before closing the writer.

### What changes are included in this PR?

This branch adds a `writer.AppendKeyValueMetadata(key, value)` method to the parquet `file.Writer` and to the `pqarrow.FileWriter`.

### Are these changes tested?

Tests are added for the new functionality.

### Are there any user-facing changes?

The `KeyValueMetadata` field on the parquet `file.Writer` has been renamed to `initialKeyValueMetadata`.  This is a breaking change.  Although the field was exported, setting it did not result in new key value metadata being written.  Instead, it represented the initial key value metadata if the writer was passed the `WithWriteMetadata` write option.

The `WithWriteMetadata` option can still be used to provide the initial key value metadata values.  In addition, the `AppendKeyValueMetadata` method can be called to add key value pairs after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.  Previously, setting this field value had no effect.

**This PR includes breaking changes to public APIs.** 

The `KeyValueMetadata` field is no longer exported from the parquet `file.Writer` struct.  Use the `WithWriteMetadata` writer option to set key value metadata when creating a writer or use the `AppendKeyValueMetadata` method to add key value metadata after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.
* Closes: apache#35775

Authored-by: Tim Schaub <tim@planet.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…tten after writing row groups (apache#37786)

### Rationale for this change

The key value file metadata may include information generated while writing row groups.  Currently, it is not possible to set the key value file metadata after creating a writer.  With the changes in this branch, key value pairs may be added any time before closing the writer.

### What changes are included in this PR?

This branch adds a `writer.AppendKeyValueMetadata(key, value)` method to the parquet `file.Writer` and to the `pqarrow.FileWriter`.

### Are these changes tested?

Tests are added for the new functionality.

### Are there any user-facing changes?

The `KeyValueMetadata` field on the parquet `file.Writer` has been renamed to `initialKeyValueMetadata`.  This is a breaking change.  Although the field was exported, setting it did not result in new key value metadata being written.  Instead, it represented the initial key value metadata if the writer was passed the `WithWriteMetadata` write option.

The `WithWriteMetadata` option can still be used to provide the initial key value metadata values.  In addition, the `AppendKeyValueMetadata` method can be called to add key value pairs after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.  Previously, setting this field value had no effect.

**This PR includes breaking changes to public APIs.** 

The `KeyValueMetadata` field is no longer exported from the parquet `file.Writer` struct.  Use the `WithWriteMetadata` writer option to set key value metadata when creating a writer or use the `AppendKeyValueMetadata` method to add key value metadata after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.
* Closes: apache#35775

Authored-by: Tim Schaub <tim@planet.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…tten after writing row groups (apache#37786)

### Rationale for this change

The key value file metadata may include information generated while writing row groups.  Currently, it is not possible to set the key value file metadata after creating a writer.  With the changes in this branch, key value pairs may be added any time before closing the writer.

### What changes are included in this PR?

This branch adds a `writer.AppendKeyValueMetadata(key, value)` method to the parquet `file.Writer` and to the `pqarrow.FileWriter`.

### Are these changes tested?

Tests are added for the new functionality.

### Are there any user-facing changes?

The `KeyValueMetadata` field on the parquet `file.Writer` has been renamed to `initialKeyValueMetadata`.  This is a breaking change.  Although the field was exported, setting it did not result in new key value metadata being written.  Instead, it represented the initial key value metadata if the writer was passed the `WithWriteMetadata` write option.

The `WithWriteMetadata` option can still be used to provide the initial key value metadata values.  In addition, the `AppendKeyValueMetadata` method can be called to add key value pairs after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.  Previously, setting this field value had no effect.

**This PR includes breaking changes to public APIs.** 

The `KeyValueMetadata` field is no longer exported from the parquet `file.Writer` struct.  Use the `WithWriteMetadata` writer option to set key value metadata when creating a writer or use the `AppendKeyValueMetadata` method to add key value metadata after creating a writer.

The `FileMetadata` field on the parquet `file.Writer` has been removed.
* Closes: apache#35775

Authored-by: Tim Schaub <tim@planet.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting merge Awaiting merge Breaking Change Includes a breaking change to the API Component: Go
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Go][Parquet] pqarrow.FileWriter allow adding parquet metadata after writing rowgroups
2 participants