[SPARK-39910][SQL] Delegate path qualification to filesystem during DataSource file path globbing #43463

tigrulya-exe · 2023-10-20T09:28:21Z

What changes were proposed in this pull request?

In current version DataSource#checkAndGlobPathIfNecessary qualifies paths via Path#makeQualified and PartitioningAwareFileIndex qualifies via FileSystem#makeQualified. Most FileSystem implementations simply delegate to Path#makeQualified, but others, like HarFileSystem contain fs-specific logic, that can produce different result. Such inconsistencies can lead to a situation, when spark can't find partitions of the source file, because qualified paths, built by Path and FileSystem are different. Therefore, for uniformity, the FileSystem path qualification should be used in DataSource#checkAndGlobPathIfNecessary.

Why are the changes needed?

Allow users to read files from hadoop archives (.har) using DataFrameReader API

Does this PR introduce any user-facing change?

No

How was this patch tested?

New tests were added in DataSourceSuite and DataFrameReaderWriterSuite

Was this patch authored or co-authored using generative AI tooling?

No

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceSuite.scala

beliefer · 2023-10-20T11:34:32Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceSuite.scala

+  val jsonRelativeHarPath = new Path("/test.json")
+  val parquetRelativeHarPath = new Path("/test.parquet")
+  val orcRelativeHarPath = new Path("/test.orc")
+  val globeRelativeHarPath = new Path("/test.*")


Do we really need to test all the file format?

They're needed to test glob path. I decided to reuse har from DataFrameReaderWriterSuite tests instead of creating new archive.

beliefer · 2023-10-23T12:20:40Z

@tigrulya-exe Please re-trigger GA tests.

tigrulya-exe · 2023-10-24T08:27:04Z

@beliefer I re-ran tests several times, but they failed either due to lack of resources or due to flaky ProtobufCatalystDataConversionSuite. I will rebase on #43493 after it will be merged

beliefer · 2023-10-25T06:40:50Z

Please fix the conflicts.

tigrulya-exe · 2023-10-30T07:02:08Z

@beliefer Hi! I fixed the conflicts and rebased on master

beliefer

cc @cloud-fan

tigrulya-exe · 2023-11-07T12:52:30Z

@cloud-fan Hi! Could you take a look please?

tigrulya-exe · 2024-01-19T09:46:03Z

@cloud-fan Hi! I've rebased on master and fixed conflicts. Could you please take a look?

cloud-fan · 2024-02-01T15:15:30Z

The fix is straightforward but the test is convoluted. How do you test HarFileSystem? I can't find any code setting the file system, but only the directory name contains har.

tigrulya-exe · 2024-02-06T13:30:21Z

@cloud-fan we construct absolute file paths with har:// scheme in the DataSourceSuite#buildFullHarPaths method and then check that they're correctly qualified. Then we test reading files inside har archive by their absolute paths.

We don't need to create or test the HarFileSystem itself, it is extracted from the path in the DataSource#checkAndGlobPathIfNecessary method.

cloud-fan · 2024-02-06T14:25:07Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+  test("SPARK-39910: read files from Hadoop archives") {
+    val fileSchema = new StructType().add("str", StringType)
+    val harPath = testFile("test-data/test-archive.har")
+      .replaceFirst("file:/", "har:/")


So Spark works with har:/ paths out of the box? BTW, I think this test is good enough, we don't need to add more tests in DataSourceSuite.

Yes, the HarFileSystem support is included in the HDFS client by default. Ok, removed tests from DataSourceSuite, left only MockFileSystem#getUri method to correctly qualify paths with mockFs:// scheme.

cloud-fan · 2024-02-06T14:25:48Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+    val harPath = testFile("test-data/test-archive.har")
+      .replaceFirst("file:/", "har:/")
+
+    testRead(spark.read.textFile(s"$harPath/test.txt").toDF(), data, textSchema)


Since we only want to test path globbing, I think testing one file format is sufficient.

Ok, removed file formats other than csv

cloud-fan · 2024-02-08T02:04:17Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/DataSourceSuite.scala

@@ -214,4 +216,6 @@ class MockFileSystem extends RawLocalFileSystem {
  override def globStatus(pathPattern: Path): Array[FileStatus] = {
    mockGlobResults.getOrElse(pathPattern, Array())
  }
+
+  override def getUri: URI = URI.create("mockFs://mockFs/")


is this change needed?

Yes, if we don't override this method, then path check inside fs.makeQualified(path) will fail, because it expects path with file:// scheme (MockFileSystem inherits RawLocalFileSystem)

sql/core/src/test/resources/test-data/test-archive.har/_index

…ataSource file path globbing

cloud-fan · 2024-02-08T12:27:10Z

thanks, merging to master/3.5!

…ataSource file path globbing In current version `DataSource#checkAndGlobPathIfNecessary` qualifies paths via `Path#makeQualified` and `PartitioningAwareFileIndex` qualifies via `FileSystem#makeQualified`. Most `FileSystem` implementations simply delegate to `Path#makeQualified`, but others, like `HarFileSystem` contain fs-specific logic, that can produce different result. Such inconsistencies can lead to a situation, when spark can't find partitions of the source file, because qualified paths, built by `Path` and `FileSystem` are different. Therefore, for uniformity, the `FileSystem` path qualification should be used in `DataSource#checkAndGlobPathIfNecessary`. Allow users to read files from hadoop archives (.har) using DataFrameReader API No New tests were added in `DataSourceSuite` and `DataFrameReaderWriterSuite` No Closes #43463 from tigrulya-exe/SPARK-39910-use-fs-path-qualification. Authored-by: Tigran Manasyan <t.manasyan@arenadata.io> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit b7edc5f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Oct 20, 2023

beliefer reviewed Oct 20, 2023

View reviewed changes

tigrulya-exe requested a review from beliefer October 23, 2023 07:07

tigrulya-exe force-pushed the SPARK-39910-use-fs-path-qualification branch from 898e77e to 4532b46 Compare October 25, 2023 08:41

beliefer reviewed Oct 30, 2023

View reviewed changes

tigrulya-exe force-pushed the SPARK-39910-use-fs-path-qualification branch from 4532b46 to a005f65 Compare January 18, 2024 08:40

github-actions bot added the BUILD label Jan 18, 2024

cloud-fan reviewed Feb 6, 2024

View reviewed changes

tigrulya-exe force-pushed the SPARK-39910-use-fs-path-qualification branch from a005f65 to 8161581 Compare February 7, 2024 14:33

cloud-fan reviewed Feb 8, 2024

View reviewed changes

sql/core/src/test/resources/test-data/test-archive.har/_index Outdated Show resolved Hide resolved

cloud-fan approved these changes Feb 8, 2024

View reviewed changes

[SPARK-39910][SQL] Delegate path qualification to filesystem during D…

227889b

…ataSource file path globbing

tigrulya-exe force-pushed the SPARK-39910-use-fs-path-qualification branch from 8161581 to 227889b Compare February 8, 2024 08:29

cloud-fan closed this in b7edc5f Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39910][SQL] Delegate path qualification to filesystem during DataSource file path globbing #43463

[SPARK-39910][SQL] Delegate path qualification to filesystem during DataSource file path globbing #43463

tigrulya-exe commented Oct 20, 2023 •

edited

Loading

beliefer Oct 20, 2023

tigrulya-exe Oct 20, 2023 •

edited

Loading

beliefer commented Oct 23, 2023

tigrulya-exe commented Oct 24, 2023

beliefer commented Oct 25, 2023

tigrulya-exe commented Oct 30, 2023

beliefer left a comment

tigrulya-exe commented Nov 7, 2023

tigrulya-exe commented Jan 19, 2024

cloud-fan commented Feb 1, 2024

tigrulya-exe commented Feb 6, 2024 •

edited

Loading

cloud-fan Feb 6, 2024

tigrulya-exe Feb 7, 2024

cloud-fan Feb 6, 2024

tigrulya-exe Feb 7, 2024

cloud-fan Feb 8, 2024

tigrulya-exe Feb 8, 2024

cloud-fan commented Feb 8, 2024

[SPARK-39910][SQL] Delegate path qualification to filesystem during DataSource file path globbing #43463

[SPARK-39910][SQL] Delegate path qualification to filesystem during DataSource file path globbing #43463

Conversation

tigrulya-exe commented Oct 20, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

beliefer Oct 20, 2023

Choose a reason for hiding this comment

tigrulya-exe Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

beliefer commented Oct 23, 2023

tigrulya-exe commented Oct 24, 2023

beliefer commented Oct 25, 2023

tigrulya-exe commented Oct 30, 2023

beliefer left a comment

Choose a reason for hiding this comment

tigrulya-exe commented Nov 7, 2023

tigrulya-exe commented Jan 19, 2024

cloud-fan commented Feb 1, 2024

tigrulya-exe commented Feb 6, 2024 • edited Loading

cloud-fan Feb 6, 2024

Choose a reason for hiding this comment

tigrulya-exe Feb 7, 2024

Choose a reason for hiding this comment

cloud-fan Feb 6, 2024

Choose a reason for hiding this comment

tigrulya-exe Feb 7, 2024

Choose a reason for hiding this comment

cloud-fan Feb 8, 2024

Choose a reason for hiding this comment

tigrulya-exe Feb 8, 2024

Choose a reason for hiding this comment

cloud-fan commented Feb 8, 2024

tigrulya-exe commented Oct 20, 2023 •

edited

Loading

tigrulya-exe Oct 20, 2023 •

edited

Loading

tigrulya-exe commented Feb 6, 2024 •

edited

Loading