Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value #561

Merged
merged 2 commits into from
Aug 25, 2020

Conversation

qinghui-xu
Copy link
Contributor

No description provided.

@qinghui-xu qinghui-xu changed the title Parquet-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value Nov 30, 2018
@julienledem
Copy link
Member

julienledem commented Dec 5, 2018

@BenoitHanotte: if you could give a code review and approve that'd be great

Copy link

@BenoitHanotte BenoitHanotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I have been away for the past two weeks.
I had the opportunity to review this commit in our fork already, thanks @qinghui-xu for adding the schema evolution tests! :)
The commit looks good to me as we have previously addressed most of my comments 👍

I have only a few minor comments on the PR: I would prefer not to commit the protoc maven plugin changes with this commit (they are also in #506), and there are a few changes to the licence headers due to IntelliJ that we could maybe filter out of the PR.

Thanks for your time on this @qinghui-xu !

<artifactId>maven-antrun-plugin</artifactId>
<groupId>com.github.os72</groupId>
<artifactId>protoc-jar-maven-plugin</artifactId>
<version>3.6.0.1</version>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this version match the protobuf version or is the plugin version unrelated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protoc (protobuf version) is defined by the plugin configuration protocVersion, so any recent version of plugin should be good.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I just wanted to be sure

@@ -159,28 +159,24 @@
</plugin>

<plugin>
<artifactId>maven-antrun-plugin</artifactId>
<groupId>com.github.os72</groupId>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change in configuration seems to be a duplicate of #506
If we can, I would suggest keeping it distinct from this PR as it addresses different needs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I'm trying to have #506 merged before this PR.

Path file = writeMessages(dataV2);
List<TestProto3SchemaV1.MessageSchema> messagesV1 = readMessages(file, TestProto3SchemaV1.MessageSchema.class);
assertEquals(messagesV1.size(), 1);
assertEquals(messagesV1.get(0).getOptionalLabelNumberPairValue(), 2);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to check that the label is returned as UNKNOWN_ENUM_VALUE_2 ? As this mechanism is used to ensure that we can then correctly read the unknown enum and get its number.

Copy link
Contributor Author

@qinghui-xu qinghui-xu Dec 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to test the label UNKNOWN_ENUM_VALUE_*, it is generated only when calling protobuf reflection API, and we are interested mainly in being able to get the number back.
It should be a test for protobuf reflection API that tests by passing an unknown number to an enum field, we can get a label UNKNOWN_ENUM_VALUE_* by using protobuf reflection API. Less relevant to test it here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, got it, thanks!

@BenoitHanotte
Copy link

BenoitHanotte commented Dec 18, 2018

@qinghui-xu Could you please remove the maven plugin part from this PR? Once it is removed I don't see anything left to address before we can merge 👍
Thanks

Copy link

@BenoitHanotte BenoitHanotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but we need to merge the maven change #506 first 👍

@qinghui-xu
Copy link
Contributor Author

Rebased on the lastest upstream master (with the maven-protoc-plugin) and force-pushed to the PR.

Copy link

@BenoitHanotte BenoitHanotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, no new change with the rebase since my last review

LOG.info("No enum is written for " + enumNameNumberMapping.getKey());
}
int idx = 0;
for (Map.Entry<String, Integer> nameNumberPair : enumNameNumberMapping.getValue().entrySet()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since Parquet already depends on Guava, we can use Guava's Joiner class to stringify the map to spare this for loop, something like this:

Joiner.on(METADATA_ENUM_ITEM_SEPARATOR)
        .withKeyValueSeparator(METADATA_ENUM_KEY_VALUE_SEPARATOR)
        .join(enumNameNumberMapping.getValue())

I couldn't find anything to simplify the reading path. Just an idea, maybe it is better not to depend on 3rd party libs too much. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to not use guava if possible

Copy link

@costimuraru costimuraru Feb 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to make the method easier to read, if we're not going with guava (which I do like, though I don't see it being used in parquet-proto), we could extract the joining part into a dedicated method. Something like:

  @Override
  public FinalizedWriteContext finalizeWrite() {
    Map<String, String> protoMetadata = new HashMap<>();
    for (Map.Entry<String, Map<String, Integer>> enumNameNumberMapping : protoEnumBookKeeper.entrySet()) {
      
      if (enumNameNumberMapping.getValue().isEmpty()) {
        // No enum is ever written to any column of this file, put an empty string as the value in the metadata
        LOG.info("No enum is written for " + enumNameNumberMapping.getKey());
      }
      
      String nameNumberPairs = joinNameNumberPairs(enumNameNumberMapping.getValue());
      protoMetadata.put(METADATA_ENUM_PREFIX + enumNameNumberMapping.getKey(), nameNumberPairs);
    }
    return new FinalizedWriteContext(protoMetadata);
  }

  public String joinNameNumberPairs(Map<String, Integer> nameNumberMap) {
    StringBuilder nameNumberPairs = new StringBuilder();
    int idx = 0;
    for (Map.Entry<String, Integer> nameNumberPair : nameNumberMap.entrySet()) {
      nameNumberPairs.append(nameNumberPair.getKey())
        .append(METADATA_ENUM_KEY_VALUE_SEPARATOR)
        .append(nameNumberPair.getValue());
      idx ++;
      if (idx < nameNumberMap.size()) {
        nameNumberPairs.append(METADATA_ENUM_ITEM_SEPARATOR);
      }
    }
    return nameNumberPairs;
  }

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. @costimuraru
I refactor out all the logic of building enum metadata in a method. I think this would be more clear.

Copy link
Contributor

@nandorKollar nandorKollar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@BenoitHanotte BenoitHanotte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@browles
Copy link

browles commented Aug 6, 2020

Hey guys, what is the status on this PR? We are currently experiencing this issue.

@gszadovszky
Copy link
Contributor

Seems like the guys missed to push it after approving.
@qinghui-xu, are you willing to pick this up again and resolve the conflicts?

@qinghui-xu
Copy link
Contributor Author

Hey, guys.
Will be glad to work on it. Would try to push another patch at the beginning of next week.

Protobuf can set enum field using number, while a number does not
match any enum value defined in the schema, it is still accepted
and a label "UNKNOWN_ENUM_<enumName>_<number>" is generated when
we use protobuf reflection API (proto descriptors) to access it.
And in parquet-protobuf, we rely on protobuf reflection API to
convert forward/backward between the two world.

There are two cases of unknown enum while using parquet-protobuf:
  1. Protobuf already contains unknown enum when we write it to
  parquet (eg1. sometmes people set enum fields using numbers; eg2
  writer deserialize data from wire and the sender can have a newer
  version of proto schema with new enum values). The behavior of
  parquet-protobuf writer as before this patch is to write a label
  "UNKNOWN_ENUM_<number>" as string in the enum column of parquet.
  And when we read it back as protobuf, we found this unknown label
  which does not match any enum def (even with the same schema as
  the sender in eg2)
  2. Protobuf contains valid value when write to parquet, but the
  reader uses an outdated proto schema which misses some enum
  values. So the not-in-old-schema enum values are "unknown" to the
  reader.

Previous behavior of parquet-proto reader is to reject in both
cases with some runtime exception.

To be able to handle the problems:
We keep enum (name -> number) mapping in the parquet metadata, so
that in read time, reader can discover the number and use protobuf
reflection API to set enum number.
Keep in mind though, for the case reading enum with outdated schema
(case 2), the enum read back will have the right number, but the
label is set to "UNKNOW_ENUM_<number>". So this feature is helpful
only if the user is using number to manipulate enum data.
And for old data containing "true" unknown value (thus case 1)
created before this patch (thus name -> number mapping is not
available), we now try to parse the string regarding to the
"UNKNOWN_ENUM_<number>" pattern.
If we read old data created before this patch (thus name -> number
is not available), with an outdated schema, and we find some enum
value not defined in the schema nor following "UNKNOWN_ENUM_*"
pattern, we could either fail the job by raising an exception or
treat the value as unknown enum with number -1, by setting a flag
in the configuration.

The name -> number mapping is a new metadata under the
"parquet.proto.enum" namespace. The metadata for protobuf enum
(label:number) mapping should follow some specific pattern, throw
BadConfigurationException in read time if it is not.

Tests for enum schema evolution (read/write with different protobuf
schema) are added.
@Fokko
Copy link
Contributor

Fokko commented Aug 24, 2020

@qinghui-xu The CI is experiencing some connectivity issues, could you rebase against master to retrigger the build?

@qinghui-xu
Copy link
Contributor Author

@Fokko It's already on top of the master branch. Could you retrigger the CI by hand somehow?

@gszadovszky
Copy link
Contributor

Trying to re-trigger Travis by close-reopen

@gszadovszky gszadovszky reopened this Aug 24, 2020
@gszadovszky gszadovszky merged commit 3d45fd5 into apache:master Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants