-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] Upgrade spark to 3.3.0 and dataproc to 2.1 #12701
Conversation
15d1aef
to
d8c3a23
Compare
5cb38e9
to
ba6e1d8
Compare
@@ -20,25 +20,22 @@ def init_parser(parser): | |||
|
|||
|
|||
async def async_main(args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why make these changes? Do we report the username in the underlying exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linter complained about using too generic an error class, it wants us to subclass Exception
. I removed the try except because I felt like it wasn't actually any more helpful than whatever stack trace was underneath, but I didn't think about the username not being in there. Can add that back in if you want.
hail/python/requirements.txt
Outdated
@@ -15,6 +15,6 @@ parsimonious<0.9 | |||
plotly>=5.5.0,<5.11 | |||
protobuf==3.20.2 | |||
PyJWT | |||
pyspark>=3.1.1,<3.2.0 | |||
pyspark==3.3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to do pyspark>=3.3,<3.4
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was distrusting of pyspark and wanted the exact same version that there is in dataproc, but maybe that is unfounded.
CHANGELOG: Query on Spark now officially supports Spark 3.3.0 and Dataproc 2.1.x
Tested on dataproc via
make -C hail test-dataproc-37
. Updating the dependencies introduced a few new linting checks that I fixed here. Updating pyspark necessitated a couple of changes, namely a different py4j jar and they removedSparkSession._wrapped
(but maybe we didn't need that anyway? not sure). Most importantly, the newer spark version brings with it a newer jackson version which is sufficient for the azure-storage-blob dependency, meaning we don't need to build against two different spark versions for spark and batch.