Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
From SandishKumarHN(sanysandish@gmail.com) and Mohan Parthasarathy(mposdev21@gmail.com)
Introduction
Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data. It is widely used in Kafka-based data pipelines. Unlike Avro, Spark does not have native support for protobuf. This PR provides two new functions from_proto/to_proto to read and write Protobuf data within a data frame.
The implementation is closely modeled after Avro implementation so that it is easy to understand and review the changes.
Following is an example of a typical usage.
The new functions are very similar to Avro
What is supported
What is not supported
Test cases covered
Tests have been written to test at different levels
ProtoFunctionSuite
A bunch of roundtrip tests that go through to_proto(from_proto) or from_proto(to_proto) and compare the results. It also repeats some of the tests where to_proto is called without a descriptor file where the protobuf descriptor is built from the catalyst types.
ProtoSerdeSuite
ProtoCatalystDataConversionSuite