Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40657] Add support for Java classes in Protobuf functions #38286

Closed
wants to merge 19 commits into from

Conversation

rangadi
Copy link
Contributor

@rangadi rangadi commented Oct 17, 2022

What changes were proposed in this pull request?

Adds support for compiled Java classes to Protobuf functions. This is tested with Protobuf v3 classes. V2 vs V3 issues will be handled in a separate PR. The main changes in this PR:

  • Changes to top level API:
    • Adds new version that takes just the class name.
    • Changes the order of arguments for existing API with descriptor files (messageName and descFilePath are swapped).
  • Protobuf utils methods to create descriptor from Java class name.
  • Many unit tests are update to check both versions : (1) with descriptor file and (2) with Java class name.
  • Maven build updates to generate Java classes to use in tests.
  • Miscellaneous changes:
    • Adds proto to package name in proto files used in tests.
    • A few TODO comments about improvements

Why are the changes needed?

Java compiled classes is a common method for users to provide Protobuf definitions.

Does this PR introduce any user-facing change?

No.
This updates interface, but for a new feature in active development.

How was this patch tested?

  • Unit tests

@rangadi
Copy link
Contributor Author

rangadi commented Oct 17, 2022

Copy link
Contributor Author

@rangadi rangadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few inline comments for reviewers.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rangadi
Copy link
Contributor Author

rangadi commented Oct 18, 2022

@SandishKumarHN, @mposdev21 PTAL. All the checks pass and I addressed all the comments.
Once you LGTM, I can ping committers.

connector/protobuf/pom.xml Show resolved Hide resolved
@github-actions github-actions bot added the INFRA label Oct 18, 2022
Copy link
Contributor

@SandishKumarHN SandishKumarHN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sbt tests are passing now.

@rangadi
Copy link
Contributor Author

rangadi commented Oct 19, 2022

@SandishKumarHN pyspark tests are failing could you take a look? I will look at other errors (though they seem unrelated)
https://github.com/rangadi/spark/actions/runs/3283371018/jobs/5408025513

image

@SandishKumarHN
Copy link
Contributor

@SandishKumarHN pyspark tests are failing could you take a look? I will look at other errors (though they seem unrelated) https://github.com/rangadi/spark/actions/runs/3283371018/jobs/5408025513

image

@rangadi since this feature is swapping the from_protobuf and to_protobuf params, descFilePath, messageName -> messageName, descFilePath. the same change has to be made in python/pyspark/sql/protobuf/functions.py

@rangadi
Copy link
Contributor Author

rangadi commented Oct 19, 2022

@SandishKumarHN thanks. That makes sense. I will fix that.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rangadi
Copy link
Contributor Author

rangadi commented Oct 20, 2022

@gengliangwang, @SandishKumarHN some tests related timezone and timestamp type are failing in Run / Build modules: catalyst, hive-thriftserver

They seem unrelated to this PR. Any idea what might be causing this?

image

@gengliangwang
Copy link
Member

@rangadi I check the latest run and it fails the python check
https://github.com/rangadi/spark/runs/9009122825
The catalyst tests passed.

@amaliujia
Copy link
Contributor

@gengliangwang, @SandishKumarHN some tests related timezone and timestamp type are failing in Run / Build modules: catalyst, hive-thriftserver

They seem unrelated to this PR. Any idea what might be causing this?

image

I saw this once yestearday. Tried retrigger the test and it passed.

@SandishKumarHN
Copy link
Contributor

@gengliangwang, @SandishKumarHN some tests related timezone and timestamp type are failing in Run / Build modules: catalyst, hive-thriftserver

They seem unrelated to this PR. Any idea what might be causing this?

image

@rangadi test failures are not related to PR. why don't we rerun the failed test? and I see the below test failing consistently. tried merging with the master?

Start checking the generated codes in pyspark-connect.
RUN: /__w/spark/spark/connector/connect/dev/generate_protos.sh /tmp/tmpiuin_p7y
Different files: ['base_pb2.pyi', 'commands_pb2.pyi', 'expressions_pb2.pyi', 'relations_pb2.pyi', 'types_pb2.pyi']
Generated files for pyspark-connect are out of sync! Please run ./connector/connect/dev/generate_protos.sh
Error: Process completed with exit code 255.

@rangadi
Copy link
Contributor Author

rangadi commented Oct 21, 2022

tried merging with the master?

I think it merges with master before running the tests.

@HeartSaVioR
Copy link
Contributor

I'm merging this since it only fails with Spark client python generation and @gengliangwang approved it.

@HeartSaVioR
Copy link
Contributor

Thanks @rangadi for your contribution! I merged into master.

@@ -24,6 +24,7 @@ import org.apache.spark.sql.{RandomDataGenerator, Row}
import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow, NoopFilters, OrderedFilters, StructFilters}
import org.apache.spark.sql.catalyst.expressions.{ExpressionEvalHelper, GenericInternalRow, Literal}
import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, GenericArrayData, MapData}
import org.apache.spark.sql.protobuf.protos.CatalystTypes.BytesMsg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am struggling in generating this file in IDEA:
Screenshot 2022-10-28 at 17 02 04

Is any way to generate all required dependencies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk did you import it as maven or sbt project? Both should work though.
With maven, 'generating sources' should fix that (in maven 'Tool Window'). I am also seeing similar issue with it, IntelliJ does not seem to run the generate-test-sources phase.

I tried just now and running ../../build/mvn test in protobuf directory fixed this issue.

image

DeZepTup pushed a commit to DeZepTup/spark-custom that referenced this pull request Oct 31, 2022
### What changes were proposed in this pull request?

Adds support for compiled Java classes to Protobuf functions. This is tested with Protobuf v3 classes. V2 vs V3 issues will be handled in a separate PR. The main changes in this PR:

 - Changes to top level API:
    - Adds new version that takes just the class name.
    - Changes the order of arguments for existing API with descriptor files (`messageName` and `descFilePath` are swapped).
 - Protobuf utils methods to create descriptor from Java class name.
 - Many unit tests are update to check both versions : (1) with descriptor file and (2) with Java class name.
 - Maven build updates to generate Java classes to use in tests.
 - Miscellaneous changes:
    - Adds `proto` to package name in `proto` files used in tests.
    - A few TODO comments about improvements

### Why are the changes needed?
Java compiled classes is a common method for users to provide Protobuf definitions.

### Does this PR introduce _any_ user-facing change?
No.
This updates interface, but for a new feature in active development.

### How was this patch tested?
 - Unit tests

Closes apache#38286 from rangadi/protobuf-java.

Authored-by: Raghu Angadi <raghu.angadi@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?

Adds support for compiled Java classes to Protobuf functions. This is tested with Protobuf v3 classes. V2 vs V3 issues will be handled in a separate PR. The main changes in this PR:

 - Changes to top level API:
    - Adds new version that takes just the class name.
    - Changes the order of arguments for existing API with descriptor files (`messageName` and `descFilePath` are swapped).
 - Protobuf utils methods to create descriptor from Java class name.
 - Many unit tests are update to check both versions : (1) with descriptor file and (2) with Java class name.
 - Maven build updates to generate Java classes to use in tests.
 - Miscellaneous changes:
    - Adds `proto` to package name in `proto` files used in tests.
    - A few TODO comments about improvements

### Why are the changes needed?
Java compiled classes is a common method for users to provide Protobuf definitions.

### Does this PR introduce _any_ user-facing change?
No.
This updates interface, but for a new feature in active development.

### How was this patch tested?
 - Unit tests

Closes apache#38286 from rangadi/protobuf-java.

Authored-by: Raghu Angadi <raghu.angadi@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants