Skip to content

Conversation

vrozov
Copy link
Member

@vrozov vrozov commented Aug 22, 2025

What changes were proposed in this pull request?

Upgrade Hive dependency to 4.1.0

Why are the changes needed?

Apache Hive 1.x, 2.x and 3.x are EOL and are not longer supported by the Apache Hive community

Does this PR introduce any user-facing change?

Yes, it drops support for EOL versions of Hive 1.x, 2.x and 3.x.

There is no change in the Spark behavior and several changes on Hive 4.x compared to Hive 2.3:

  • Hive managed tables
  • Indices are not supported
  • ORC invalid Gregorian dates are handled differently compared to Spark and prior Hive version

How was this patch tested?

Few tests required modification due to changed Hive behavior, otherwise all existing Spark tests pass.

Was this patch authored or co-authored using generative AI tooling?

No

@sarutak
Copy link
Member

sarutak commented Aug 27, 2025

@vrozov
Thank you for your contribution! I'll leave high level comments.

  • There seem lots of non trivial changes to upgrade Hive to 4.1 so it's better to explain what changes are needed for what purpose and the reason, and what and how did you tested. These helps reviewers.
  • I'm concerning whether upgrading Hive to 4.1 and dropping 2.3 causes not small impact for users. Can we switch 2.3 and 4.1 using profile (-P flag) when we build and for the time being, use 2.3 by default?
  • This PR is large and changes span a wide range of files. To make review easily and avoid conflict with other PRs, can we break this PR into smaller ones somehow? If we can switch 4.1 and 2.3 like I mentioned above, we don't need to complete upgrade to 4.1 at once.

@dongjoon-hyun Do you have other suggestions because you had experience to add 2.3 support.

@dongjoon-hyun
Copy link
Member

To @sarutak , I have no opinion for the PR because apparently this PR didn't pass the CI yet and what I can say for now is that Apache Spark community doesn't allow this kind of regression (of this PR). I'm not sure why Hive 4.1 upgrade proposal enforces us to downgrade the dependencies. I hope it's a mistake and to make it sure that to avoid this kind of hidden stuff.

- <antlr4.version>4.13.1</antlr4.version>
+ <antlr4.version>4.9.3</antlr4.version>

@sarutak
Copy link
Member

sarutak commented Aug 27, 2025

@dongjoon-hyun
I didn't intend to ask you to look into each part of changes as it didn't pass CI as you mentioned. I wanted to ask you high level comment If you had because upgrading Hive should be done carefully .

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 27, 2025

Sorry but I'm not using Apache Hive in these days in all cases including even a HiveMetaStore. So, it's a little hard for me to help this PR.

According to my previous experience, I can say that this PR requires significant verification efforts even after CI passes.

May I ask if @vrozov and @sarutak uses Hive 4.1 or this patch internally in the production level with AWS Glue? If then, it would be a great reference to the community.

@sarutak
Copy link
Member

sarutak commented Aug 28, 2025

@dongjoon-hyun
Thank you for your comment. Yes, I agree that we need to verify very carefully so I'm wondering if we can minimize the risk by allowing to build with both 2.3(as default) and 4.1 transitionally like as we supported to build with Scala 2.12 and 2.13 before.

May I ask if @vrozov and @sarutak uses Hive 4.1 or this patch internally in the production level with AWS Glue? If then, it would be a great reference to the community.

@vrozov Can you share the background if possible?

@vrozov
Copy link
Member Author

vrozov commented Aug 29, 2025

To @sarutak , I have no opinion for the PR because apparently this PR didn't pass the CI yet and what I can say for now is that Apache Spark community doesn't allow this kind of regression (of this PR). I'm not sure why Hive 4.1 upgrade proposal enforces us to downgrade the dependencies. I hope it's a mistake and to make it sure that to avoid this kind of hidden stuff.

- <antlr4.version>4.13.1</antlr4.version>
+ <antlr4.version>4.9.3</antlr4.version>

@dongjoon-hyun @sarutak Unfortunately it is not a mistake. Hive uses 4.9.3 that is not compatible with 4.10.x and above: "Mixing ANTLR 4.9.3 and 4.10 can lead to errors that point to a version mismatch. A very common Java error looks like this: Caused by: java.io.InvalidClassException: org.antlr.v4.runtime.atn.ATN; Could not deserialize ATN with version 3 (expected 4)." and this is exactly the error I got without changing antlr4.version.

One possible solution is to shade antlr4 in Spark, so there will be no conflict between Spark and Hive version. Please let me know what do you think and if that sounds reasonable, I'll open a separate PR for shading.

Copy link
Member

@sarutak sarutak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vrozov
I have not yet looked into all the changes deeply but I'll leave some comments.
There seems some compatibility break in this PR. I think we should make consensus before we make critical changes.

MVN="build/mvn"
HADOOP_HIVE_PROFILES=(
hadoop-3-hive-2.3
hadoop-3-hive-4.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4.1?

# Generate manifests for each Hadoop profile:
for HADOOP_HIVE_PROFILE in "${HADOOP_HIVE_PROFILES[@]}"; do
if [[ $HADOOP_HIVE_PROFILE == **hadoop-3-hive-2.3** ]]; then
if [[ $HADOOP_HIVE_PROFILE == **hadoop-3-hive-4.0** ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

sql/core/src/test/resources/SPARK-33084.jar
sql/core/src/test/resources/artifact-tests/udf_noA.jar
sql/hive-thriftserver/src/test/resources/TestUDTF.jar
sql/hive/src/test/noclasspath/hive-test-udfs.jar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this and related change are needed? If there is a reason, it should be explained in the PR description.

<jaxb.version>2.2.11</jaxb.version>
<libthrift.version>0.16.0</libthrift.version>
<antlr4.version>4.13.1</antlr4.version>
<antlr4.version>4.9.3</antlr4.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, we don't allow downgrading dependencies as @dongjoon-hyun mentioned.
If there is dependency conflict between Spark and Hive mentioned here, we should discuss first how to resolve it first.
Also, please don't make this kind of change without explanation.

-Djdk.reflect.useDirectMethodHandle=false
-Dio.netty.tryReflectionSetAccessible=true
--enable-native-access=ALL-UNNAMED
-Dmvn.executable=${maven.multiModuleProjectDirectory}/build/mvn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed?

}
DeserializerLock.synchronized {
deserializer.initialize(hconf, props)
deserializer.initialize(hconf, props, partProps)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need pass partProps as third parameter while null is passed to other initialize.


def listFunctions(hive: Hive, db: String, pattern: String): Seq[String]

def dropIndex(hive: Hive, dbName: String, tableName: String, indexName: String): Unit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of change seems to break compatibility.
Did you test that old hive metastores work with this change?

protected val versions = if (testVersions.nonEmpty) {
testVersions.get.split(",").map(_.trim).filter(_.nonEmpty).toIndexedSeq
} else {
IndexedSeq("2.0", "2.1", "2.2", "2.3", "3.0", "3.1", "4.0", "4.1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we drop tests with old clients? Upgrading builtin Hive and supported metastore version should be discussed separately.

hadoopConf.set("hive.metastore.schema.verification", "false")
// Since Hive 3.0, HIVE-19310 skipped `ensureDbInit` if `hive.in.test=false`.
if (version == "3.0" || version == "3.1" || version == "4.0") {
if (version == "3.0" || version == "3.1" || version == "4.1") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is 4.0 excluded?

if (version == "3.0" || version == "3.1" || version == "4.0") {
if (version == "3.0" || version == "3.1" || version == "4.1") {
hadoopConf.set("hive.in.test", "true")
hadoopConf.set("hive.query.reexecution.enabled", "false")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed?

@vrozov vrozov changed the title [SPARK-51348][BUILD][SQL] Upgrade Hive to 4.1 [SPARK-52408][BUILD][SQL] Upgrade Hive to 4.1 Sep 23, 2025
@vrozov
Copy link
Member Author

vrozov commented Oct 1, 2025

@dongjoon-hyun #52490 WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants