-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1455: [parquet-protobuf] Handle protobuf enum schema evolution and unknown enum value #561
Conversation
@BenoitHanotte: if you could give a code review and approve that'd be great |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I have been away for the past two weeks.
I had the opportunity to review this commit in our fork already, thanks @qinghui-xu for adding the schema evolution tests! :)
The commit looks good to me as we have previously addressed most of my comments 👍
I have only a few minor comments on the PR: I would prefer not to commit the protoc maven plugin changes with this commit (they are also in #506), and there are a few changes to the licence headers due to IntelliJ that we could maybe filter out of the PR.
Thanks for your time on this @qinghui-xu !
parquet-protobuf/pom.xml
Outdated
<artifactId>maven-antrun-plugin</artifactId> | ||
<groupId>com.github.os72</groupId> | ||
<artifactId>protoc-jar-maven-plugin</artifactId> | ||
<version>3.6.0.1</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this version match the protobuf version or is the plugin version unrelated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The protoc (protobuf version) is defined by the plugin configuration protocVersion
, so any recent version of plugin should be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I just wanted to be sure
parquet-protobuf/pom.xml
Outdated
@@ -159,28 +159,24 @@ | |||
</plugin> | |||
|
|||
<plugin> | |||
<artifactId>maven-antrun-plugin</artifactId> | |||
<groupId>com.github.os72</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change in configuration seems to be a duplicate of #506
If we can, I would suggest keeping it distinct from this PR as it addresses different needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I'm trying to have #506 merged before this PR.
Path file = writeMessages(dataV2); | ||
List<TestProto3SchemaV1.MessageSchema> messagesV1 = readMessages(file, TestProto3SchemaV1.MessageSchema.class); | ||
assertEquals(messagesV1.size(), 1); | ||
assertEquals(messagesV1.get(0).getOptionalLabelNumberPairValue(), 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to check that the label is returned as UNKNOWN_ENUM_VALUE_2
? As this mechanism is used to ensure that we can then correctly read the unknown enum and get its number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to test the label UNKNOWN_ENUM_VALUE_*
, it is generated only when calling protobuf reflection API, and we are interested mainly in being able to get the number back.
It should be a test for protobuf reflection API that tests by passing an unknown number to an enum field, we can get a label UNKNOWN_ENUM_VALUE_*
by using protobuf reflection API. Less relevant to test it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, got it, thanks!
@qinghui-xu Could you please remove the maven plugin part from this PR? Once it is removed I don't see anything left to address before we can merge 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but we need to merge the maven change #506 first 👍
d83198a
to
d72b678
Compare
Rebased on the lastest upstream master (with the maven-protoc-plugin) and force-pushed to the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, no new change with the rebase since my last review
LOG.info("No enum is written for " + enumNameNumberMapping.getKey()); | ||
} | ||
int idx = 0; | ||
for (Map.Entry<String, Integer> nameNumberPair : enumNameNumberMapping.getValue().entrySet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since Parquet already depends on Guava, we can use Guava's Joiner class to stringify the map to spare this for loop, something like this:
Joiner.on(METADATA_ENUM_ITEM_SEPARATOR)
.withKeyValueSeparator(METADATA_ENUM_KEY_VALUE_SEPARATOR)
.join(enumNameNumberMapping.getValue())
I couldn't find anything to simplify the reading path. Just an idea, maybe it is better not to depend on 3rd party libs too much. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to not use guava if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to make the method easier to read, if we're not going with guava (which I do like, though I don't see it being used in parquet-proto
), we could extract the joining part into a dedicated method. Something like:
@Override
public FinalizedWriteContext finalizeWrite() {
Map<String, String> protoMetadata = new HashMap<>();
for (Map.Entry<String, Map<String, Integer>> enumNameNumberMapping : protoEnumBookKeeper.entrySet()) {
if (enumNameNumberMapping.getValue().isEmpty()) {
// No enum is ever written to any column of this file, put an empty string as the value in the metadata
LOG.info("No enum is written for " + enumNameNumberMapping.getKey());
}
String nameNumberPairs = joinNameNumberPairs(enumNameNumberMapping.getValue());
protoMetadata.put(METADATA_ENUM_PREFIX + enumNameNumberMapping.getKey(), nameNumberPairs);
}
return new FinalizedWriteContext(protoMetadata);
}
public String joinNameNumberPairs(Map<String, Integer> nameNumberMap) {
StringBuilder nameNumberPairs = new StringBuilder();
int idx = 0;
for (Map.Entry<String, Integer> nameNumberPair : nameNumberMap.entrySet()) {
nameNumberPairs.append(nameNumberPair.getKey())
.append(METADATA_ENUM_KEY_VALUE_SEPARATOR)
.append(nameNumberPair.getValue());
idx ++;
if (idx < nameNumberMap.size()) {
nameNumberPairs.append(METADATA_ENUM_ITEM_SEPARATOR);
}
}
return nameNumberPairs;
}
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply. @costimuraru
I refactor out all the logic of building enum metadata in a method. I think this would be more clear.
parquet-protobuf/src/test/java/org/apache/parquet/proto/TestUtils.java
Outdated
Show resolved
Hide resolved
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoRecordConverter.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Hey guys, what is the status on this PR? We are currently experiencing this issue. |
Seems like the guys missed to push it after approving. |
Hey, guys. |
3b9834c
to
fcab2a1
Compare
Protobuf can set enum field using number, while a number does not match any enum value defined in the schema, it is still accepted and a label "UNKNOWN_ENUM_<enumName>_<number>" is generated when we use protobuf reflection API (proto descriptors) to access it. And in parquet-protobuf, we rely on protobuf reflection API to convert forward/backward between the two world. There are two cases of unknown enum while using parquet-protobuf: 1. Protobuf already contains unknown enum when we write it to parquet (eg1. sometmes people set enum fields using numbers; eg2 writer deserialize data from wire and the sender can have a newer version of proto schema with new enum values). The behavior of parquet-protobuf writer as before this patch is to write a label "UNKNOWN_ENUM_<number>" as string in the enum column of parquet. And when we read it back as protobuf, we found this unknown label which does not match any enum def (even with the same schema as the sender in eg2) 2. Protobuf contains valid value when write to parquet, but the reader uses an outdated proto schema which misses some enum values. So the not-in-old-schema enum values are "unknown" to the reader. Previous behavior of parquet-proto reader is to reject in both cases with some runtime exception. To be able to handle the problems: We keep enum (name -> number) mapping in the parquet metadata, so that in read time, reader can discover the number and use protobuf reflection API to set enum number. Keep in mind though, for the case reading enum with outdated schema (case 2), the enum read back will have the right number, but the label is set to "UNKNOW_ENUM_<number>". So this feature is helpful only if the user is using number to manipulate enum data. And for old data containing "true" unknown value (thus case 1) created before this patch (thus name -> number mapping is not available), we now try to parse the string regarding to the "UNKNOWN_ENUM_<number>" pattern. If we read old data created before this patch (thus name -> number is not available), with an outdated schema, and we find some enum value not defined in the schema nor following "UNKNOWN_ENUM_*" pattern, we could either fail the job by raising an exception or treat the value as unknown enum with number -1, by setting a flag in the configuration. The name -> number mapping is a new metadata under the "parquet.proto.enum" namespace. The metadata for protobuf enum (label:number) mapping should follow some specific pattern, throw BadConfigurationException in read time if it is not. Tests for enum schema evolution (read/write with different protobuf schema) are added.
fcab2a1
to
a5af072
Compare
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java
Outdated
Show resolved
Hide resolved
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java
Outdated
Show resolved
Hide resolved
parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoMessageConverter.java
Outdated
Show resolved
Hide resolved
@qinghui-xu The CI is experiencing some connectivity issues, could you rebase against master to retrigger the build? |
@Fokko It's already on top of the master branch. Could you retrigger the CI by hand somehow? |
Trying to re-trigger Travis by close-reopen |
No description provided.