-
Notifications
You must be signed in to change notification settings - Fork 28.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-c…
…lasses This is the follow-up PR to #37972 and #38212 ### What changes were proposed in this pull request? 1. Move spark-protobuf error classes to the spark error-classes framework(core/src/main/resources/error/error-classes.json). 2. Support protobuf imports 3. validate protobuf timestamp and duration types. ### Why are the changes needed? N/A ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Existing tests should cover the validation of this PR. CC: rangadi mposdev21 gengliangwang Closes #38344 from SandishKumarHN/SPARK-40777-ProtoErrorCls. Authored-by: SandishKumarHN <sanysandish@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
- Loading branch information
1 parent
d1dfa43
commit 5741d38
Showing
20 changed files
with
625 additions
and
191 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
5741d38
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In many places (micro services), engineers love to use
oneof
data type and circular references in their schema models for the sake of flexibility. Whereas, handling them in data warehouse or data lake is non-trivial.We can't reject these use cases by erroring out on a circular reference to prevent an infinite looping during schema parsing. Thereby, I propose the following configuration parameters to empower users to choose their way in handling circular references.
protobufDescriptorConfig: { descriptorFilePath: /dbfs/FileStore/users/xinyu_liu/protobuf/event-trading-prime.desc messageName: MaterializedEvent circularReferenceTolerance: 0 circularReferenceType: field_name }
In which,
circularReferenceType
has 2 enum values,When navigating a Protobuf schema, a repetitive **fully-qualified field name** is considered a circular reference,
FIELD_NAME
When navigating a Protobuf schema, a repetitive **field type** is considered a circular reference,
FIELD_TYPE
circularReferenceTolerance
has Int type and may take a value of(-1, 0, 1, 2, ...)
.When
circularReferenceTolerance=-1
, a RuntimeException is raised by detecting a circular reference.circularReferenceTolerance=0
will drop the field when it is repetitively entered.circularReferenceTolerance=1
allows the same Protobuf message name/type to be entered twice, but dropped the third time encountered.Hope this design is simple but flexible to help engineers cope with circular references in schemas.
Above as a followup to the delightful discussion with Sandish.
Thank you
Xinyu Liu