Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39910][SQL] Delegate path qualification to filesystem during DataSource file path globbing #43463

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions dev/.rat-excludes
Original file line number Diff line number Diff line change
Expand Up @@ -138,3 +138,4 @@ people.xml
ui-test/package.json
ui-test/package-lock.json
core/src/main/resources/org/apache/spark/ui/static/package.json
.*\.har
Original file line number Diff line number Diff line change
Expand Up @@ -760,7 +760,7 @@ object DataSource extends Logging {
val qualifiedPaths = pathStrings.map { pathString =>
val path = new Path(pathString)
val fs = path.getFileSystem(hadoopConf)
path.makeQualified(fs.getUri, fs.getWorkingDirectory)
fs.makeQualified(path)
}

// Split the paths into glob and non glob paths, because we don't need to do an existence check
Expand Down
2 changes: 2 additions & 0 deletions sql/core/src/test/resources/test-data/test-archive.har/_index
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
%2F dir 1707380620211+493+tigrulya+hadoop 0 0 test.csv
%2Ftest.csv file part-0 0 6 1707380620197+420+tigrulya+hadoop
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
3
0 1948547033 0 119
3 changes: 3 additions & 0 deletions sql/core/src/test/resources/test-data/test-archive.har/part-0
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
1
2
3
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@

package org.apache.spark.sql.execution.datasources

import java.net.URI

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileStatus, Path, RawLocalFileSystem}
import org.scalatest.PrivateMethodTester
Expand Down Expand Up @@ -214,4 +216,6 @@ class MockFileSystem extends RawLocalFileSystem {
override def globStatus(pathPattern: Path): Array[FileStatus] = {
mockGlobResults.getOrElse(pathPattern, Array())
}

override def getUri: URI = URI.create("mockFs://mockFs/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we don't override this method, then path check inside fs.makeQualified(path) will fail, because it expects path with file:// scheme (MockFileSystem inherits RawLocalFileSystem)

}
Original file line number Diff line number Diff line change
Expand Up @@ -1363,4 +1363,12 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSparkSession with
}
}
}

test("SPARK-39910: read files from Hadoop archives") {
val fileSchema = new StructType().add("str", StringType)
val harPath = testFile("test-data/test-archive.har")
.replaceFirst("file:/", "har:/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Spark works with har:/ paths out of the box? BTW, I think this test is good enough, we don't need to add more tests in DataSourceSuite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the HarFileSystem support is included in the HDFS client by default. Ok, removed tests from DataSourceSuite, left only MockFileSystem#getUri method to correctly qualify paths with mockFs:// scheme.


testRead(spark.read.schema(fileSchema).csv(s"$harPath/test.csv"), data, fileSchema)
}
}