[SPARK-52339][SQL][3.5] Fix comparison of `InMemoryFileIndex` instances #51256

bersprockets · 2025-06-24T01:16:31Z

What changes were proposed in this pull request?

This is a back-port of #51043.

This PR changes InMemoryFileIndex#equals to compare a non-distinct collection of root paths rather than a distinct set of root paths. Without this change, InMemoryFileIndex#equals considers the following two collections of root paths to be equal, even though they represent a different number of rows:

["/tmp/test", "/tmp/test"]
["/tmp/test", "/tmp/test", "/tmp/test"]

Why are the changes needed?

The bug can cause correctness issues, e.g.

// create test data
val data = Seq((1, 2), (2, 3)).toDF("a", "b")
data.write.mode("overwrite").csv("/tmp/test")

val fileList1 = List.fill(2)("/tmp/test")
val fileList2 = List.fill(3)("/tmp/test")

val df1 = spark.read.schema("a int, b int").csv(fileList1: _*)
val df2 = spark.read.schema("a int, b int").csv(fileList2: _*)

df1.count() // correctly returns 4
df2.count() // correctly returns 6

// the following is the same as above, except df1 is persisted
val df1 = spark.read.schema("a int, b int").csv(fileList1: _*).persist
val df2 = spark.read.schema("a int, b int").csv(fileList2: _*)

df1.count() // correctly returns 4
df2.count() // incorrectly returns 4!!

In the above example, df1 and df2 were created with a different number of paths: df1 has 2, and df2 has 3. But since the distinct set of root paths is the same (e.g., Set("/tmp/test") == Set("/tmp/test")), the two dataframes are considered equal. Thus, when df1 is persisted, df2 uses df1's cached plan.

The same bug also causes inappropriate exchange reuse.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit test.

Was this patch authored or co-authored using generative AI tooling?

No.

### What changes were proposed in this pull request? This PR changes `InMemoryFileIndex#equals` to compare a non-distinct collection of root paths rather than a distinct set of root paths. Without this change, `InMemoryFileIndex#equals` considers the following two collections of root paths to be equal, even though they represent a different number of rows: ``` ["/tmp/test", "/tmp/test"] ["/tmp/test", "/tmp/test", "/tmp/test"] ``` ### Why are the changes needed? The bug can cause correctness issues, e.g. ``` // create test data val data = Seq((1, 2), (2, 3)).toDF("a", "b") data.write.mode("overwrite").csv("/tmp/test") val fileList1 = List.fill(2)("/tmp/test") val fileList2 = List.fill(3)("/tmp/test") val df1 = spark.read.schema("a int, b int").csv(fileList1: _*) val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) df1.count() // correctly returns 4 df2.count() // correctly returns 6 // the following is the same as above, except df1 is persisted val df1 = spark.read.schema("a int, b int").csv(fileList1: _*).persist val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) df1.count() // correctly returns 4 df2.count() // incorrectly returns 4!! ``` In the above example, df1 and df2 were created with a different number of paths: df1 has 2, and df2 has 3. But since the distinct set of root paths is the same (e.g., `Set("/tmp/test") == Set("/tmp/test"))`, the two dataframes are considered equal. Thus, when df1 is persisted, df2 uses df1's cached plan. The same bug also causes inappropriate exchange reuse. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#51043 from bersprockets/multi_path_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

bersprockets · 2025-06-24T01:17:33Z

This was a clean cherry pick from the master branch, so I didn't customize the title or the description to mention the 3.5 branch.

bersprockets · 2025-06-24T04:06:33Z

The description has a little cruft at the bottom... let me fix...

### What changes were proposed in this pull request? This is a back-port of #51043. This PR changes `InMemoryFileIndex#equals` to compare a non-distinct collection of root paths rather than a distinct set of root paths. Without this change, `InMemoryFileIndex#equals` considers the following two collections of root paths to be equal, even though they represent a different number of rows: ``` ["/tmp/test", "/tmp/test"] ["/tmp/test", "/tmp/test", "/tmp/test"] ``` ### Why are the changes needed? The bug can cause correctness issues, e.g. ``` // create test data val data = Seq((1, 2), (2, 3)).toDF("a", "b") data.write.mode("overwrite").csv("/tmp/test") val fileList1 = List.fill(2)("/tmp/test") val fileList2 = List.fill(3)("/tmp/test") val df1 = spark.read.schema("a int, b int").csv(fileList1: _*) val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) df1.count() // correctly returns 4 df2.count() // correctly returns 6 // the following is the same as above, except df1 is persisted val df1 = spark.read.schema("a int, b int").csv(fileList1: _*).persist val df2 = spark.read.schema("a int, b int").csv(fileList2: _*) df1.count() // correctly returns 4 df2.count() // incorrectly returns 4!! ``` In the above example, df1 and df2 were created with a different number of paths: df1 has 2, and df2 has 3. But since the distinct set of root paths is the same (e.g., `Set("/tmp/test") == Set("/tmp/test"))`, the two dataframes are considered equal. Thus, when df1 is persisted, df2 uses df1's cached plan. The same bug also causes inappropriate exchange reuse. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #51256 from bersprockets/multi_path_issue_br35. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>

yaooqinn · 2025-06-24T06:04:45Z

Merged into branch 3.5, thank you @bersprockets

github-actions bot added the SQL label Jun 24, 2025

HyukjinKwon changed the title ~~[SPARK-52339][SQL] Fix comparison of InMemoryFileIndex instances~~ [SPARK-52339][SQL][3.5] Fix comparison of InMemoryFileIndex instances Jun 24, 2025

yaooqinn approved these changes Jun 24, 2025

View reviewed changes

yaooqinn closed this Jun 24, 2025

bersprockets deleted the multi_path_issue_br35 branch July 7, 2025 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52339][SQL][3.5] Fix comparison of `InMemoryFileIndex` instances #51256

[SPARK-52339][SQL][3.5] Fix comparison of `InMemoryFileIndex` instances #51256

Uh oh!

bersprockets commented Jun 24, 2025 •

edited

Loading

Uh oh!

bersprockets commented Jun 24, 2025

Uh oh!

bersprockets commented Jun 24, 2025

Uh oh!

yaooqinn commented Jun 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-52339][SQL][3.5] Fix comparison of InMemoryFileIndex instances #51256

[SPARK-52339][SQL][3.5] Fix comparison of InMemoryFileIndex instances #51256

Uh oh!

Conversation

bersprockets commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

bersprockets commented Jun 24, 2025

Uh oh!

bersprockets commented Jun 24, 2025

Uh oh!

yaooqinn commented Jun 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-52339][SQL][3.5] Fix comparison of `InMemoryFileIndex` instances #51256

[SPARK-52339][SQL][3.5] Fix comparison of `InMemoryFileIndex` instances #51256

bersprockets commented Jun 24, 2025 •

edited

Loading