-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-52408][BUILD][SQL] Upgrade Hive to 4.1 #52099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@vrozov
@dongjoon-hyun Do you have other suggestions because you had experience to add |
To @sarutak , I have no opinion for the PR because apparently this PR didn't pass the CI yet and what I can say for now is that Apache Spark community doesn't allow this kind of regression (of this PR). I'm not sure why Hive 4.1 upgrade proposal enforces us to downgrade the dependencies. I hope it's a mistake and to make it sure that to avoid this kind of hidden stuff. - <antlr4.version>4.13.1</antlr4.version>
+ <antlr4.version>4.9.3</antlr4.version> |
@dongjoon-hyun |
Sorry but I'm not using Apache Hive in these days in all cases including even a HiveMetaStore. So, it's a little hard for me to help this PR. According to my previous experience, I can say that this PR requires significant verification efforts even after CI passes. May I ask if @vrozov and @sarutak uses Hive 4.1 or this patch internally in the production level with AWS Glue? If then, it would be a great reference to the community. |
@dongjoon-hyun
@vrozov Can you share the background if possible? |
@dongjoon-hyun @sarutak Unfortunately it is not a mistake. Hive uses 4.9.3 that is not compatible with 4.10.x and above: "Mixing ANTLR 4.9.3 and 4.10 can lead to errors that point to a version mismatch. A very common Java error looks like this: One possible solution is to shade antlr4 in Spark, so there will be no conflict between Spark and Hive version. Please let me know what do you think and if that sounds reasonable, I'll open a separate PR for shading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vrozov
I have not yet looked into all the changes deeply but I'll leave some comments.
There seems some compatibility break in this PR. I think we should make consensus before we make critical changes.
MVN="build/mvn" | ||
HADOOP_HIVE_PROFILES=( | ||
hadoop-3-hive-2.3 | ||
hadoop-3-hive-4.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4.1
?
# Generate manifests for each Hadoop profile: | ||
for HADOOP_HIVE_PROFILE in "${HADOOP_HIVE_PROFILES[@]}"; do | ||
if [[ $HADOOP_HIVE_PROFILE == **hadoop-3-hive-2.3** ]]; then | ||
if [[ $HADOOP_HIVE_PROFILE == **hadoop-3-hive-4.0** ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
sql/core/src/test/resources/SPARK-33084.jar | ||
sql/core/src/test/resources/artifact-tests/udf_noA.jar | ||
sql/hive-thriftserver/src/test/resources/TestUDTF.jar | ||
sql/hive/src/test/noclasspath/hive-test-udfs.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this and related change are needed? If there is a reason, it should be explained in the PR description.
<jaxb.version>2.2.11</jaxb.version> | ||
<libthrift.version>0.16.0</libthrift.version> | ||
<antlr4.version>4.13.1</antlr4.version> | ||
<antlr4.version>4.9.3</antlr4.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, we don't allow downgrading dependencies as @dongjoon-hyun mentioned.
If there is dependency conflict between Spark and Hive mentioned here, we should discuss first how to resolve it first.
Also, please don't make this kind of change without explanation.
-Djdk.reflect.useDirectMethodHandle=false | ||
-Dio.netty.tryReflectionSetAccessible=true | ||
--enable-native-access=ALL-UNNAMED | ||
-Dmvn.executable=${maven.multiModuleProjectDirectory}/build/mvn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change needed?
} | ||
DeserializerLock.synchronized { | ||
deserializer.initialize(hconf, props) | ||
deserializer.initialize(hconf, props, partProps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need pass partProps
as third parameter while null
is passed to other initialize
.
|
||
def listFunctions(hive: Hive, db: String, pattern: String): Seq[String] | ||
|
||
def dropIndex(hive: Hive, dbName: String, tableName: String, indexName: String): Unit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This kind of change seems to break compatibility.
Did you test that old hive metastores work with this change?
protected val versions = if (testVersions.nonEmpty) { | ||
testVersions.get.split(",").map(_.trim).filter(_.nonEmpty).toIndexedSeq | ||
} else { | ||
IndexedSeq("2.0", "2.1", "2.2", "2.3", "3.0", "3.1", "4.0", "4.1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we drop tests with old clients? Upgrading builtin Hive and supported metastore version should be discussed separately.
hadoopConf.set("hive.metastore.schema.verification", "false") | ||
// Since Hive 3.0, HIVE-19310 skipped `ensureDbInit` if `hive.in.test=false`. | ||
if (version == "3.0" || version == "3.1" || version == "4.0") { | ||
if (version == "3.0" || version == "3.1" || version == "4.1") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is 4.0
excluded?
if (version == "3.0" || version == "3.1" || version == "4.0") { | ||
if (version == "3.0" || version == "3.1" || version == "4.1") { | ||
hadoopConf.set("hive.in.test", "true") | ||
hadoopConf.set("hive.query.reexecution.enabled", "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change needed?
@dongjoon-hyun #52490 WDYT? |
What changes were proposed in this pull request?
Upgrade Hive dependency to 4.1.0
Why are the changes needed?
Apache Hive 1.x, 2.x and 3.x are EOL and are not longer supported by the Apache Hive community
Does this PR introduce any user-facing change?
Yes, it drops support for EOL versions of Hive 1.x, 2.x and 3.x.
There is no change in the Spark behavior and several changes on Hive 4.x compared to Hive 2.3:
How was this patch tested?
Few tests required modification due to changed Hive behavior, otherwise all existing Spark tests pass.
Was this patch authored or co-authored using generative AI tooling?
No