[SPARK-35045][SQL] Add an internal option to control input buffer in univocity #32145

HyukjinKwon · 2021-04-13T06:43:59Z

What changes were proposed in this pull request?

This PR makes the input buffer configurable (as an internal option). This is mainly to work around uniVocity/univocity-parsers#449.

Why are the changes needed?

To work around uniVocity/univocity-parsers#449.

Does this PR introduce any user-facing change?

No, it's only internal option.

How was this patch tested?

Manually tested by modifying the unittest added in #31858 as below:

diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index fd25a79619d..b58f0bd3661 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -2460,6 +2460,7 @@ abstract class CSVSuite
       Seq(line).toDF.write.text(path.getAbsolutePath)
       assert(spark.read.format("csv")
         .option("delimiter", "|")
+        .option("inputBufferSize", "128")
         .option("ignoreTrailingWhiteSpace", "true").load(path.getAbsolutePath).count() == 1)
     }
   }

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

…CSVOptions.scala

SparkQA · 2021-04-13T08:18:37Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41860/

SparkQA · 2021-04-13T08:42:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41861/

SparkQA · 2021-04-13T08:48:40Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41861/

SparkQA · 2021-04-13T11:56:19Z

Test build #137281 has finished for PR 32145 at commit 2420be1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2021-04-13T12:07:31Z

+1, LGTM. Merging to master/3.1/3.0.
Thank you @HyukjinKwon .

…univocity ### What changes were proposed in this pull request? This PR makes the input buffer configurable (as an internal option). This is mainly to work around uniVocity/univocity-parsers#449. ### Why are the changes needed? To work around uniVocity/univocity-parsers#449. ### Does this PR introduce _any_ user-facing change? No, it's only internal option. ### How was this patch tested? Manually tested by modifying the unittest added in #31858 as below: ```diff diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index fd25a79619d..b58f0bd3661 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala -2460,6 +2460,7 abstract class CSVSuite Seq(line).toDF.write.text(path.getAbsolutePath) assert(spark.read.format("csv") .option("delimiter", "|") + .option("inputBufferSize", "128") .option("ignoreTrailingWhiteSpace", "true").load(path.getAbsolutePath).count() == 1) } } ``` Closes #32145 from HyukjinKwon/SPARK-35045. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 1f56215) Signed-off-by: Max Gekk <max.gekk@gmail.com>

…univocity This PR makes the input buffer configurable (as an internal option). This is mainly to work around uniVocity/univocity-parsers#449. To work around uniVocity/univocity-parsers#449. No, it's only internal option. Manually tested by modifying the unittest added in #31858 as below: ```diff diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index fd25a79619d..b58f0bd3661 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala -2460,6 +2460,7 abstract class CSVSuite Seq(line).toDF.write.text(path.getAbsolutePath) assert(spark.read.format("csv") .option("delimiter", "|") + .option("inputBufferSize", "128") .option("ignoreTrailingWhiteSpace", "true").load(path.getAbsolutePath).count() == 1) } } ``` Closes #32145 from HyukjinKwon/SPARK-35045. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 1f56215) Signed-off-by: Max Gekk <max.gekk@gmail.com>

SparkQA · 2021-04-13T13:15:56Z

Test build #137282 has finished for PR 32145 at commit f1f92fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-04-14T00:57:01Z

Thx Max.

…univocity ### What changes were proposed in this pull request? This PR makes the input buffer configurable (as an internal option). This is mainly to work around uniVocity/univocity-parsers#449. ### Why are the changes needed? To work around uniVocity/univocity-parsers#449. ### Does this PR introduce _any_ user-facing change? No, it's only internal option. ### How was this patch tested? Manually tested by modifying the unittest added in apache#31858 as below: ```diff diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala index fd25a79619d..b58f0bd3661 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala -2460,6 +2460,7 abstract class CSVSuite Seq(line).toDF.write.text(path.getAbsolutePath) assert(spark.read.format("csv") .option("delimiter", "|") + .option("inputBufferSize", "128") .option("ignoreTrailingWhiteSpace", "true").load(path.getAbsolutePath).count() == 1) } } ``` Closes apache#32145 from HyukjinKwon/SPARK-35045. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 1f56215) Signed-off-by: Max Gekk <max.gekk@gmail.com>

Add an internal option to control input buffer in univocity

2420be1

HyukjinKwon requested a review from MaxGekk April 13, 2021 06:44

MaxGekk approved these changes Apr 13, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala Outdated Show resolved Hide resolved

HyukjinKwon added 3 commits April 13, 2021 15:55

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/…

8ef9121

…CSVOptions.scala

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/…

0ac48e6

…CSVOptions.scala

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/…

f1f92fb

…CSVOptions.scala

MaxGekk closed this in 1f56215 Apr 13, 2021

github-actions bot added the SQL label Apr 13, 2021

HyukjinKwon deleted the SPARK-35045 branch January 4, 2022 00:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-35045][SQL] Add an internal option to control input buffer in univocity #32145

[SPARK-35045][SQL] Add an internal option to control input buffer in univocity #32145

HyukjinKwon commented Apr 13, 2021 •

edited

Loading

SparkQA commented Apr 13, 2021

SparkQA commented Apr 13, 2021

SparkQA commented Apr 13, 2021

SparkQA commented Apr 13, 2021

MaxGekk commented Apr 13, 2021

SparkQA commented Apr 13, 2021

HyukjinKwon commented Apr 14, 2021

[SPARK-35045][SQL] Add an internal option to control input buffer in univocity #32145

[SPARK-35045][SQL] Add an internal option to control input buffer in univocity #32145

Conversation

HyukjinKwon commented Apr 13, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Apr 13, 2021

SparkQA commented Apr 13, 2021

SparkQA commented Apr 13, 2021

SparkQA commented Apr 13, 2021

MaxGekk commented Apr 13, 2021

SparkQA commented Apr 13, 2021

HyukjinKwon commented Apr 14, 2021

HyukjinKwon commented Apr 13, 2021 •

edited

Loading