Skip to content

Commit

Permalink
[SPARK-35045][SQL] Add an internal option to control input buffer in …
Browse files Browse the repository at this point in the history
…univocity

### What changes were proposed in this pull request?

This PR makes the input buffer configurable (as an internal option). This is mainly to work around uniVocity/univocity-parsers#449.

### Why are the changes needed?

To work around uniVocity/univocity-parsers#449.

### Does this PR introduce _any_ user-facing change?

No, it's only internal option.

### How was this patch tested?

Manually tested by modifying the unittest added in #31858 as below:

```diff
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index fd25a79619d..b58f0bd3661 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
 -2460,6 +2460,7  abstract class CSVSuite
       Seq(line).toDF.write.text(path.getAbsolutePath)
       assert(spark.read.format("csv")
         .option("delimiter", "|")
+        .option("inputBufferSize", "128")
         .option("ignoreTrailingWhiteSpace", "true").load(path.getAbsolutePath).count() == 1)
     }
   }
```

Closes #32145 from HyukjinKwon/SPARK-35045.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 1f56215)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
  • Loading branch information
HyukjinKwon authored and MaxGekk committed Apr 13, 2021
1 parent c71126a commit bd03050
Showing 1 changed file with 3 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,8 @@ class CSVOptions(
}
val lineSeparatorInWrite: Option[String] = lineSeparator

val inputBufferSize: Option[Int] = parameters.get("inputBufferSize").map(_.toInt)

/**
* The handling method to be used when unescaped quotes are found in the input.
*/
Expand Down Expand Up @@ -257,6 +259,7 @@ class CSVOptions(
settings.setIgnoreLeadingWhitespaces(ignoreLeadingWhiteSpaceInRead)
settings.setIgnoreTrailingWhitespaces(ignoreTrailingWhiteSpaceInRead)
settings.setReadInputOnSeparateThread(false)
inputBufferSize.foreach(settings.setInputBufferSize)
settings.setMaxColumns(maxColumns)
settings.setNullValue(nullValue)
settings.setEmptyValue(emptyValueInRead)
Expand Down

0 comments on commit bd03050

Please sign in to comment.