Skip to content

Commit

Permalink
[SPARK-32815][ML][3.0] Fix LibSVM data source loading error on file p…
Browse files Browse the repository at this point in the history
…aths with glob metacharacters

### What changes were proposed in this pull request?
In the PR, I propose to fix an issue with LibSVM datasource when both of the following are true:
* no user specified schema
* some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc.

The fix is a backport of #29670, and it is based on another bug fix for CSV/JSON datasources #29659.

### Why are the changes needed?
To fix the issue when the follow two queries try to read from paths `[abc]`:
```scala
spark.read.format("libsvm").load("""/tmp/\[abc\].csv""").show
```
but would end up hitting an exception:
```
Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm;
org.apache.spark.sql.AnalysisException: Path does not exist: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-6ef0ae5e-ff9f-4c4f-9ff4-0db3ee1f6a82/[abc]/part-00000-26406ab9-4e56-45fd-a25a-491c18a05e76-c000.libsvm;
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$3(DataSource.scala:770)
	at org.apache.spark.util.ThreadUtils$.$anonfun$parmap$2(ThreadUtils.scala:373)
	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659)
	at scala.util.Success.$anonfun$map$1(Try.scala:255)
	at scala.util.Success.map(Try.scala:213)
```

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added UT to `LibSVMRelationSuite`.

Closes #29675 from MaxGekk/globbing-paths-when-inferring-schema-ml-3.0.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information
MaxGekk authored and cloud-fan committed Sep 8, 2020
1 parent 9b39e4b commit 8c0b9cb
Show file tree
Hide file tree
Showing 3 changed files with 23 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ private[libsvm] class LibSVMFileFormat
"though the input. If you know the number in advance, please specify it via " +
"'numFeatures' option to avoid the extra scan.")

val paths = files.map(_.getPath.toUri.toString)
val paths = files.map(_.getPath.toString)
val parsed = MLUtils.parseLibSVMFile(sparkSession, paths)
MLUtils.computeNumFeatures(parsed)
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,8 @@ object MLUtils extends Logging {
DataSource.apply(
sparkSession,
paths = paths,
className = classOf[TextFileFormat].getName
className = classOf[TextFileFormat].getName,
options = Map(DataSource.GLOB_PATHS_KEY -> "false")
).resolveRelation(checkFilesExist = false))
.select("value")

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -191,4 +191,24 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext {
spark.sql("DROP TABLE IF EXISTS libsvmTable")
}
}

test("SPARK-32815: Test LibSVM data source on file paths with glob metacharacters") {
withTempDir { dir =>
val basePath = dir.getCanonicalPath
// test libsvm writer / reader without specifying schema
val svmFileName = "[abc]"
val escapedSvmFileName = "\\[abc\\]"
val rawData = new java.util.ArrayList[Row]()
rawData.add(Row(1.0, Vectors.sparse(2, Seq((0, 2.0), (1, 3.0)))))
val struct = new StructType()
.add("labelFoo", DoubleType, false)
.add("featuresBar", VectorType, false)
val df = spark.createDataFrame(rawData, struct)
df.write.format("libsvm").save(s"$basePath/$svmFileName")
val df2 = spark.read.format("libsvm").load(s"$basePath/$escapedSvmFileName")
val row1 = df2.first()
val v = row1.getAs[SparseVector](1)
assert(v == Vectors.sparse(2, Seq((0, 2.0), (1, 3.0))))
}
}
}

0 comments on commit 8c0b9cb

Please sign in to comment.