Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] Upgrade spark to 3.3.0 and dataproc to 2.1 #12701

Merged
merged 6 commits into from
Feb 17, 2023

Conversation

daniel-goldstein
Copy link
Contributor

@daniel-goldstein daniel-goldstein commented Feb 15, 2023

CHANGELOG: Query on Spark now officially supports Spark 3.3.0 and Dataproc 2.1.x

Tested on dataproc via make -C hail test-dataproc-37. Updating the dependencies introduced a few new linting checks that I fixed here. Updating pyspark necessitated a couple of changes, namely a different py4j jar and they removed SparkSession._wrapped (but maybe we didn't need that anyway? not sure). Most importantly, the newer spark version brings with it a newer jackson version which is sufficient for the azure-storage-blob dependency, meaning we don't need to build against two different spark versions for spark and batch.

@@ -20,25 +20,22 @@ def init_parser(parser):


async def async_main(args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why make these changes? Do we report the username in the underlying exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linter complained about using too generic an error class, it wants us to subclass Exception. I removed the try except because I felt like it wasn't actually any more helpful than whatever stack trace was underneath, but I didn't think about the username not being in there. Can add that back in if you want.

@@ -15,6 +15,6 @@ parsimonious<0.9
plotly>=5.5.0,<5.11
protobuf==3.20.2
PyJWT
pyspark>=3.1.1,<3.2.0
pyspark==3.3.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to do pyspark>=3.3,<3.4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was distrusting of pyspark and wanted the exact same version that there is in dataproc, but maybe that is unfounded.

@danking danking merged commit 75f351d into hail-is:main Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants