Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: Drop Spark 3.2 support #581

Merged
merged 12 commits into from
Jun 18, 2024
Merged

build: Drop Spark 3.2 support #581

merged 12 commits into from
Jun 18, 2024

Conversation

huaxingao
Copy link
Contributor

Which issue does this PR close?

Closes 565.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

@huaxingao
Copy link
Contributor Author

cc @andygrove @viirya @kazuyukitanimura @parthchandra
Could you please take a look when you have a moment? Thanks!

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks installation.md and overview.md have 3.2 mentioned.

We can also remove spark-3.2 shims.

Additionally, we can remove a few more things e.g. ShimCometParquetUtils, github actions, etc...

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, but a few more things.

Github action CI for 3.2 should be dropped.
ShimCometBatchScanExec can be also cleaned up. I.e. moving keyGroupedPartitioning and inputPartitions to CometBatchScanExec

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending ci

}

// TODO: remove after dropping Spark 3.2 support and directly call new FileScanRDD
// TODO: remove after dropping Spark 3.4 support and directly call new FileScanRDD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.4 or 3.3? I don't see we explicitly mention 3.4 in other places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 3.4 because FileScanRDD has a different signature in 4.0
Here is the 4.0 signature

class FileScanRDD(
    @transient private val sparkSession: SparkSession,
    readFunction: (PartitionedFile) => Iterator[InternalRow],
    @transient val filePartitions: Seq[FilePartition],
    val readSchema: StructType,
    val metadataColumns: Seq[AttributeReference] = Seq.empty,
    metadataExtractors: Map[String, PartitionedFile => Any] = Map.empty,
    options: FileSourceOptions = new FileSourceOptions(CaseInsensitiveMap(Map.empty)))

Here is the 3.4 signature

class FileScanRDD(
    @transient private val sparkSession: SparkSession,
    readFunction: (PartitionedFile) => Iterator[InternalRow],
    @transient val filePartitions: Seq[FilePartition],
    val readSchema: StructType,
    val metadataColumns: Seq[AttributeReference] = Seq.empty,
    options: FileSourceOptions = new FileSourceOptions(CaseInsensitiveMap(Map.empty)))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about 3.3? Is it also different to Spark 3.4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, 3.3 is also different from 3.4. Here is the 3.3 signature

class FileScanRDD(
    @transient private val sparkSession: SparkSession,
    readFunction: (PartitionedFile) => Iterator[InternalRow],
    @transient val filePartitions: Seq[FilePartition],
    val readSchema: StructType,
    val metadataColumns: Seq[AttributeReference] = Seq.empty)

Spark 3.5 has the same signature as Spark 4.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that is why I asked about

remove after dropping Spark 3.4 support ...

Isn't it Spark 3.3/Spark 3.4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok Let me rewrite this to make it more clear

@andygrove andygrove merged commit d584229 into apache:main Jun 18, 2024
41 checks passed
@huaxingao
Copy link
Contributor Author

Thanks, everyone!

@huaxingao huaxingao deleted the drop_3.2 branch June 18, 2024 23:13
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* build: Drop Spark 3.2 support

* remove un-used import

* fix BloomFilterMightContain

* revert the changes for TimestampNTZType and PartitionIdPassthrough

* address comments and remove more 3.2 related code

* remove un-used import

* put back newDataSourceRDD

* remove un-used import and put back lazy val partitions

* address comments

* Trigger Build

* remove the missed 3.2 pipeline

* address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dropping Spark 3.2 support
5 participants