[SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode #44872

MaxGekk · 2024-01-24T17:09:59Z

What changes were proposed in this pull request?

In the PR, I propose to disable the column pruning feature in the CSV datasource for the multiLine mode.

Why are the changes needed?

To workaround the issue in the uniVocity parser used by the CSV datasource: uniVocity/univocity-parsers#529

Does this PR introduce any user-facing change?

No.

How was this patch tested?

By running the affected test suites:

$ build/sbt "test:testOnly *CSVv1Suite"
$ build/sbt "test:testOnly *CSVv2Suite"
$ build/sbt "test:testOnly *CSVLegacyTimeParserSuite"
$ build/sbt "testOnly *.CsvFunctionsSuite"

Was this patch authored or co-authored using generative AI tooling?

No.

MaxGekk · 2024-01-25T08:32:53Z

cc @cloud-fan

cloud-fan · 2024-01-25T14:37:04Z

do we have a test to show the data correctness issue?

MaxGekk · 2024-01-25T17:27:31Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+        .option("header", "true")
+        .option("escape", "\"")
+        .csv(path.getCanonicalPath)
+      assert(df.count() === 5)


count() returns 4 without the changes.

MaxGekk · 2024-01-25T17:28:16Z

do we have a test to show the data correctness issue?

@cloud-fan I added a test for the issue.

MaxGekk · 2024-01-26T06:23:28Z

@cloud-fan @HyukjinKwon FYI, since 3.5.x and 3.4.x suffer from the same issue, I am going to backport this the branches. Thanks for review.

MaxGekk · 2024-01-26T08:01:50Z

Merging to master/3.5/3.4. Thank you, @HyukjinKwon @cloud-fan for review.

### What changes were proposed in this pull request? In the PR, I propose to disable the column pruning feature in the CSV datasource for the `multiLine` mode. ### Why are the changes needed? To workaround the issue in the `uniVocity` parser used by the CSV datasource: uniVocity/univocity-parsers#529 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44872 from MaxGekk/csv-disable-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 829e742) Signed-off-by: Max Gekk <max.gekk@gmail.com>

LuciferYang · 2024-01-26T09:00:53Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/csv/CSVPartitionReaderFactory.scala

@@ -58,7 +58,7 @@ case class CSVPartitionReaderFactory(
      actualReadDataSchema,
      options,
      filters)
-    val schema = if (options.columnPruning) actualReadDataSchema else actualDataSchema
+    val schema = if (options.isColumnPruningEnabled) actualReadDataSchema else actualDataSchema


spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

Line 103 in 829e742

val columnPruning = sparkSession.sessionState.conf.csvColumnPruning

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala

Line 128 in 829e742

val schema = if (columnPruning) actualRequiredSchema else actualDataSchema

@MaxGekk Should the check in CSVFileFormat be changed to parsedOptions.isColumnPruningEnabled too?

The schema is used only in CSVHeaderChecker which is supposed to check column names in CSV and provided schema fields. It shouldn't depend on the column pruning feature at all, from my point of view.

private def checkHeaderColumnNames(columnNames: Array[String]): Unit = { ... if (headerLen == schemaSize) { ... } else { errorMessage = Some( s"""|Number of column in CSV header is not equal to number of fields in the schema: | Header length: $headerLen, schema size: $schemaSize |$source""".stripMargin) }

schemaSize must be full data schema of CSV files, but not the required schema.

Let me re-think it, and avoid the dependency from the column pruning in CSVHeaderChecker.

@LuciferYang Actually, you are right. Please, review this follow up PR: #44910

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

…ing in V1 CSV datasource ### What changes were proposed in this pull request? In the PR, I propose to invoke `CSVOptons.isColumnPruningEnabled` introduced by #44872 while matching of CSV header to a schema in the V1 CSV datasource. ### Why are the changes needed? To fix the failure when column pruning happens and a schema is not enforced: ```scala scala> spark.read. | option("multiLine", true). | option("header", true). | option("escape", "\""). | option("enforceSchema", false). | csv("/Users/maximgekk/tmp/es-939111-data.csv"). | count() 24/01/27 12:43:14 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema: Header length: 4, schema size: 0 CSV file: file:///Users/maximgekk/tmp/es-939111-data.csv ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44910 from MaxGekk/check-header-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ing in V1 CSV datasource ### What changes were proposed in this pull request? In the PR, I propose to invoke `CSVOptons.isColumnPruningEnabled` introduced by #44872 while matching of CSV header to a schema in the V1 CSV datasource. ### Why are the changes needed? To fix the failure when column pruning happens and a schema is not enforced: ```scala scala> spark.read. | option("multiLine", true). | option("header", true). | option("escape", "\""). | option("enforceSchema", false). | csv("/Users/maximgekk/tmp/es-939111-data.csv"). | count() 24/01/27 12:43:14 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema: Header length: 4, schema size: 0 CSV file: file:///Users/maximgekk/tmp/es-939111-data.csv ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44910 from MaxGekk/check-header-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit bc51c9f) Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? In the PR, I propose to disable the column pruning feature in the CSV datasource for the `multiLine` mode. ### Why are the changes needed? To workaround the issue in the `uniVocity` parser used by the CSV datasource: uniVocity/univocity-parsers#529 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44872 from MaxGekk/csv-disable-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 829e742) Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ing in V1 CSV datasource ### What changes were proposed in this pull request? In the PR, I propose to invoke `CSVOptons.isColumnPruningEnabled` introduced by apache#44872 while matching of CSV header to a schema in the V1 CSV datasource. ### Why are the changes needed? To fix the failure when column pruning happens and a schema is not enforced: ```scala scala> spark.read. | option("multiLine", true). | option("header", true). | option("escape", "\""). | option("enforceSchema", false). | csv("/Users/maximgekk/tmp/es-939111-data.csv"). | count() 24/01/27 12:43:14 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.IllegalArgumentException: Number of column in CSV header is not equal to number of fields in the schema: Header length: 4, schema size: 0 CSV file: file:///Users/maximgekk/tmp/es-939111-data.csv ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44910 from MaxGekk/check-header-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit bc51c9f) Signed-off-by: Max Gekk <max.gekk@gmail.com>

nchammas · 2024-02-19T15:51:03Z

To workaround the issue in the uniVocity parser used by the CSV datasource: uniVocity/univocity-parsers#529

A bit off-topic for this PR, but is uniVocity even maintained anymore?

The last release was more than 3 years ago.
The last commit to master was almost 3 years ago.
The website is down.
There are multiple open bugs on the tracker with no indication that anyone cares.

cloud-fan · 2024-02-19T15:59:03Z

is there any other popular Java libraries for parsing CSV?

nchammas · 2024-02-19T16:09:01Z

There are a bunch of libraries listed here, but I don't have experience with any of them.

jackson-dataformats-text looks interesting. I know we already use FasterXML to parse JSON. Perhaps we should use them to parse CSV as well.

nchammas · 2024-02-27T03:25:21Z

I've filed SPARK-47180 to track potentially migrating off of Univocity to something else.

Disable CSV column pruning in the multi-line mode

8e59651

github-actions bot added the SQL label Jan 24, 2024

MaxGekk changed the title ~~[WIP][SQL] Disable CSV column pruning in the multi-line mode~~ [SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode Jan 25, 2024

MaxGekk marked this pull request as ready for review January 25, 2024 14:14

MaxGekk requested review from cloud-fan and HyukjinKwon January 25, 2024 14:15

Add a test

6b7a829

MaxGekk commented Jan 25, 2024

View reviewed changes

HyukjinKwon approved these changes Jan 26, 2024

View reviewed changes

cloud-fan approved these changes Jan 26, 2024

View reviewed changes

MaxGekk closed this in 829e742 Jan 26, 2024

LuciferYang reviewed Jan 26, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala Show resolved Hide resolved

MaxGekk mentioned this pull request Jan 27, 2024

[SPARK-46862][SQL][FOLLOWUP] Fix column pruning without schema enforcing in V1 CSV datasource #44910

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode #44872

[SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode #44872

MaxGekk commented Jan 24, 2024 •

edited

Loading

MaxGekk commented Jan 25, 2024

cloud-fan commented Jan 25, 2024

MaxGekk Jan 25, 2024

MaxGekk commented Jan 25, 2024

MaxGekk commented Jan 26, 2024

MaxGekk commented Jan 26, 2024

LuciferYang Jan 26, 2024

MaxGekk Jan 26, 2024 •

edited

Loading

MaxGekk Jan 27, 2024

nchammas commented Feb 19, 2024

cloud-fan commented Feb 19, 2024

nchammas commented Feb 19, 2024

nchammas commented Feb 27, 2024

[SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode #44872

[SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode #44872

Conversation

MaxGekk commented Jan 24, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

MaxGekk commented Jan 25, 2024

cloud-fan commented Jan 25, 2024

MaxGekk Jan 25, 2024

Choose a reason for hiding this comment

MaxGekk commented Jan 25, 2024

MaxGekk commented Jan 26, 2024

MaxGekk commented Jan 26, 2024

LuciferYang Jan 26, 2024

Choose a reason for hiding this comment

MaxGekk Jan 26, 2024 • edited Loading

Choose a reason for hiding this comment

MaxGekk Jan 27, 2024

Choose a reason for hiding this comment

nchammas commented Feb 19, 2024

cloud-fan commented Feb 19, 2024

nchammas commented Feb 19, 2024

nchammas commented Feb 27, 2024

MaxGekk commented Jan 24, 2024 •

edited

Loading

MaxGekk Jan 26, 2024 •

edited

Loading