[SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources #18581

HyukjinKwon · 2017-07-10T06:11:06Z

What changes were proposed in this pull request?

This PR proposes to add lineSep option for a configurable line separator in text-based datasources, LibSVM, JSON, CSV and Text.

Note that this PR follows Hive's default behaviour for \n - cover other newline variants.

How was this patch tested?

Unit tests in LibSVMRelationSuite.scala, CSVSuite.scala, JsonSuite.scala, TextSuite.scala and python/pyspark/sql/tests.py

SparkQA · 2017-07-10T07:04:55Z

Test build #79442 has finished for PR 18581 at commit eb77c98.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-10T07:04:55Z

Test build #79441 has finished for PR 18581 at commit a5410e5.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-10T07:04:56Z

Test build #79440 has finished for PR 18581 at commit 585a399.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-07-10T07:12:03Z

retest this please

HyukjinKwon · 2017-07-10T07:17:40Z

Let me leave this as a WIP. Will be back after having some talk with Univocity author.

HyukjinKwon · 2017-07-10T08:19:53Z

I guess we need few more tests about CSV and changes. Let me leave this out in the documentation here for now.

Few more test cases should be about writing out with a custom line separator

["a\n\na", "\nb"]

and reading in value without quotes with a custom line separator

a\n\na,\nb

which currently do not work as expected with the current state. We need setNormalizedNewline in this case but it looks we need more decisions about which value we should set (this is a single character) and behaviour.

I can make a follow-up later if we are all fine with leaving this documentation out.

HyukjinKwon · 2017-07-10T08:21:33Z

cc @cloud-fan, could you take a look please when you have some time?

HyukjinKwon · 2017-07-10T09:14:57Z

Let me cc @gatorsmile and @maropu who I believe are interested in this.

SparkQA · 2017-07-10T09:39:01Z

Test build #79446 has finished for PR 18581 at commit eb77c98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-10T09:42:58Z

Test build #79448 has finished for PR 18581 at commit eb77c98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-10T10:10:42Z

Test build #79455 has finished for PR 18581 at commit fdd624f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-07-10T10:36:28Z

Will clean up soon.

maropu · 2017-07-10T10:46:26Z

yea, thanks for pinging me! (sorry, but I didn't notice your JIRA ping) I'll check and comment on it later.

SparkQA · 2017-07-10T14:50:11Z

Test build #79465 has finished for PR 18581 at commit 4592b5b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-07-10T14:45:27Z

mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMOptions.scala

@@ -41,11 +41,15 @@ private[libsvm] class LibSVMOptions(@transient private val parameters: CaseInsen
    case o => throw new IllegalArgumentException(s"Invalid value `$o` for parameter " +
      s"`$VECTOR_TYPE`. Expected types are `sparse` and `dense`.")
  }
+
+  val lineSeparator: Option[String] = parameters.get(LINE_SEPARATOR)


We need multi-characters for the separator? Hive assumes a single character in LINES TERMINATED BY https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable

Yea, actually in case of Univocity, it requires 2 characters:

CsvWriterSettings settings = new CsvWriterSettings(); settings.getFormat().setLineSeparator("aaa");

Exception in thread "main" java.lang.IllegalArgumentException: Invalid line separator. Up to 2 characters are expected. Got 3 characters. at com.univocity.parsers.common.Format.setLineSeparator(Format.java:121) at com.univocity.parsers.common.Format.setLineSeparator(Format.java:109)

I don't see a reason to restrict this for now. At least, I can provide an usecase with Windows -\r\n.

yea, ok. But I couldn't just imagine an usecase to use more than two characters. It's okay to me that we'll follow committers decisions on this. Thanks!

I've seen datasets that have multi-character delimiters of more than 2 characters. Specifically |~|

So yes there is a use case, but it's a long tail one. I'd be happy to get this progress of up to 2 characters and work towards 3+ in a future PR

Could we put this option in a single place for these formats? I feel putting this option in each format looks a little messy...

Also, if we support one or two characters only, I feel we better explicitly throw an exception for more than two characters here.

compression is also there for many datasources. Probably, let me try to open up a discussion about tying up those later.

ok, thanks!

maropu · 2017-07-10T14:52:52Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

@@ -628,6 +630,12 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
   *   spark.read().text("/path/to/spark/README.md")
   * }}}
   *
+   * You can set the following text-specific options to deal with text files:
+   * <ul>
+   * <li>`lineSep` (default is `\r\n` or `\n`): defines the line separator that should


How about explicitly setting \n by default along with writing cases?

I was thinking in that way. However, this appears a default behaviour from Hadoop's LineRecordReader. I am worried of a behaviour change (e.g., I guess Windows users get affected by this) .

Or, how about using a platform-dependent separator in writing cases? If we keep the existing behaviour, I feel at least we better document more about it here.

It sounds tricky but I guess we don't want a platform-dependent one - SPARK-18076. Up to my knowledge, LineRecordReader should cover both \r\n and \n. I will try to improve this documentation when cleaning other comments together if we are okay with this.

ok, thanks!

maropu · 2017-07-10T14:58:20Z

Looks good to support an user-specified separator for CSV. As you suggested in the JIRA, I also feel we better file a JIRA for deprecate encoding supports in CSV and start to discuss there.

HyukjinKwon · 2017-07-10T15:25:48Z

I will definitely open a JIRA to discussion further. Thank you for your head-up.

cloud-fan · 2017-07-11T06:24:38Z

python/pyspark/sql/readwriter.py

@@ -234,6 +234,8 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
                                default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
        :param multiLine: parse one record, which may span multiple lines, per file. If None is
                          set, it uses the default value, ``false``.
+        :param lineSep: defines the line separator that should be used for parsing. If None is
+                        set, it uses the default value, ``\\r\\n`` or ``\\n``.


seems the default value is always \n? https://github.com/apache/spark/pull/18581/files#diff-059fbd7487f6bec7cbd8a57f21f8c8c5R48

It has a different value for reading and writing individually. I tried to not change both. So, the related pointers for JSON read should be ...

https://github.com/apache/spark/pull/18581/files#diff-5ac20b8d75a20117deaa9ba4af814090R142 for data

https://github.com/apache/spark/pull/18581/files#diff-5ac20b8d75a20117deaa9ba4af814090R132 for schema inference.

SparkQA · 2017-07-11T16:01:40Z

Test build #79522 has finished for PR 18581 at commit 974357b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-23T14:10:47Z

Test build #81034 has finished for PR 18581 at commit f357573.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-23T17:01:09Z

Test build #81036 has finished for PR 18581 at commit 87bb599.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-23T17:16:22Z

mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMOptions.scala

 }

 private[libsvm] object LibSVMOptions {
  val NUM_FEATURES = "numFeatures"
  val VECTOR_TYPE = "vectorType"
  val DENSE_VECTOR_TYPE = "dense"
  val SPARSE_VECTOR_TYPE = "sparse"
+  val LINE_SEPARATOR = "lineSep"


Please use the full name.

This name came after sep in CSV which resembled R. Do you prefer separator and lineSeparator?

Either lineDelimiter or lineSeparator looks fine.

In the future, we could also support field delimiters.

I actually meant sep here:

spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

Lines 488 to 489 in 3c0c2d0

* <li>`sep` (default `,`): sets the single character as a separator for each

* field and value.</li>

and was thinking of matching the name ..

Just for history, it was delimiter but renamed to sep.

gatorsmile · 2017-08-23T17:18:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala

@@ -32,7 +32,9 @@ import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
 * in that file.
 */
 class HadoopFileLinesReader(
-    file: PartitionedFile, conf: Configuration) extends Iterator[Text] with Closeable {
+    file: PartitionedFile,
+    lineSeparator: Option[String],


This should not be optional.

Could you elaborate why? LineRecordReader without this works differently, covering newline variants by default. I don't know which string I should give for LineRecordReader constructor if this is not optional.

We do not know what is the default line separator?

The problem is LineRecordReader() here. I think probably we could do LineRecordReader(null) to express both \r\n and \n but I am not sure if we should use null to express these.

When the line delimiter is '\n', any of the follow sequences will count as a delimiter: "\n", "\r\n", or "\r".

Could you check whether this is true here?

\n = LF:

Multics, Unix and Unix-like systems (Linux, macOS, FreeBSD, AIX, Xenix, etc.), BeOS, Amiga, RISC OS, and others[1]

\r\n CR+LF:

Microsoft Windows, DOS (MS-DOS, PC DOS, etc.), DEC TOPS-10, RT-11, CP/M, MP/M, Atari TOS, OS/2, Symbian OS, Palm OS, Amstrad CPC, and most other early non-Unix and non-IBM operating systems

\r = CR:

Commodore 8-bit machines, Acorn BBC, ZX Spectrum, TRS-80, Apple II family, Oberon, the classic Mac OS, MIT Lisp Machine and OS-9

It sounds like Hive's behavior is reasonable. For most external users, \n means a new line. Thus, it should match any of these three options: LF, CR+LF and CR. WDYT?

I was thinking \n means the first case of your comment above as it is set by the user explicitly. So, I thought If it is not given, it covers three cases of newlines by default. If we use \n to deal with three cases above, wouldn't we are unable to cover, the arguably corner case of only handling \n?

So far, following Hive is the safest. If users complain about it, we can behave differently from Hive with using a new SQLConf.

OK. Will change this but I should say this way looks incorrect to me and this behaviour should be discussed and possibly updated in the near future.

gatorsmile · 2017-08-23T17:28:36Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+          Files.write(path.toPath, lines.getBytes(StandardCharsets.UTF_8))
+          val df = spark.read
+            .option("multiLine", multiLine)
+            .option("lineSep", "|")


The test case should be a function. We can pass different separators and verify whether it returns the expected results.

Sure, let me give a try.

SparkQA · 2017-10-28T14:23:30Z

Test build #83168 has finished for PR 18581 at commit a240973.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-11-12T02:51:23Z

gentle ping @gatorsmile

WeichenXu123 · 2017-12-04T09:27:43Z

mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala

+        val path1 = new File(tempDir.getCanonicalPath, "write1")
+        try {
+          // Read
+          java.nio.file.Files.write(path0.toPath, lines.getBytes(StandardCharsets.UTF_8))


Why not import this ? java.nio.file.Files

To differentiate it from google's Files explicitly above. Not a big deal.

WeichenXu123 · 2017-12-04T09:29:12Z

mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala

+          val row1 = df.first()
+          assert(row1.getDouble(0) == 1.0)
+          val v = row1.getAs[SparseVector](1)
+          assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0))))


Use === instead of ==

I'd like to ask why actually. I had some discussion about this and ended up without conclusion. The doc says === is preferred but the actual error messages are even clear with == sometimes.

so I prefer to keep consistent with others.

OK but let me use ==. Seems it's used in the test cases of this file.

WeichenXu123 · 2017-12-04T09:45:09Z

mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala

+          assert(row1.getDouble(0) == 1.0)
+          val v = row1.getAs[SparseVector](1)
+          assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0))))
+


So here you only test the first line ?
Why not use df.collect() to test every line ?

Why not just test the first line?

The following test only include checking df and readbackDF equality. But, it seems we also need test the whole loaded df and raw file content equality.

Here we change how to deal with each line in iteration. I think both comparing single line or repeated multiple lines are fine. I think many tests here already test only first line?

OK, let me update it. It's easy to change anyway.

SparkQA · 2017-12-06T08:05:02Z

Test build #84531 has finished for PR 18581 at commit 265dd48.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-12-06T08:07:37Z

retest this please

SparkQA · 2017-12-06T11:16:17Z

Test build #84538 has finished for PR 18581 at commit 265dd48.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-26T05:42:07Z

It looks like this line separator has to be handled by each data source individually, can we start with, e.g., json, and then csv, text, etc.? Then we can have smaller PRs that would be easier to review.

HyukjinKwon · 2018-02-26T07:33:33Z

Sure, will try to separate this. Will update my PRs soon roughly within this week.

gatorsmile · 2018-02-28T17:29:22Z

Thanks!

HyukjinKwon · 2018-03-03T12:17:12Z

I opened #20727 for text datasource.

@cloud-fan, other text-based sources depend on text datasource in schema inference path so I made a fix for text datasource first. Please check out if that makes sense when you are available.

beickhoff · 2018-05-07T12:40:32Z

@HyukjinKwon, is there another PR to handle CSV?

HyukjinKwon · 2018-05-07T12:48:52Z

Nope not yet, I will try to make it within the next release soon.

don4of4 · 2018-11-04T22:51:39Z

Was this finished and merged in? I see https://issues.apache.org/jira/browse/SPARK-21289 is still open.

HyukjinKwon · 2018-11-05T01:10:12Z

What you see is what you get. It's not yet finished. See also #20877 (comment)

HyukjinKwon changed the title ~~[SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources~~ [WIP][SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources Jul 10, 2017

HyukjinKwon changed the title ~~[WIP][SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources~~ [SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources Jul 10, 2017

HyukjinKwon changed the title ~~[SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources~~ [WIP][SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources Jul 10, 2017

HyukjinKwon changed the title ~~[WIP][SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources~~ [SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources Jul 10, 2017

maropu reviewed Jul 10, 2017

View reviewed changes

cloud-fan reviewed Jul 11, 2017

View reviewed changes

HyukjinKwon force-pushed the linesep branch from 4592b5b to 974357b Compare July 11, 2017 13:19

HyukjinKwon force-pushed the linesep branch from 974357b to f357573 Compare August 23, 2017 14:00

gatorsmile reviewed Aug 23, 2017

View reviewed changes

HyukjinKwon added 7 commits October 28, 2017 20:13

Adds lineSep option for all text-based datasources

fb589a3

Remove perentheses in the tests

c4282bf

Make tests as functions so that other lineSep cna be tested easily.

e7983d6

Minor cleanup

16cc0c2

\n by default, covering \r, \r\n and \n.

b006d65

Fix libsvm and text too

5ce9895

Add parameter descriptions in HadoopFileLinesReader

bc65e6b

HyukjinKwon force-pushed the linesep branch from 41369cf to bc65e6b Compare October 28, 2017 11:26

Minor cleanup

a240973

WeichenXu123 reviewed Dec 4, 2017

View reviewed changes

Address comments

265dd48

HyukjinKwon closed this Mar 3, 2018

HyukjinKwon mentioned this pull request Mar 3, 2018

[SPARK-23577][SQL] Supports custom line separator for text datasource #20727

Closed

dongjoon-hyun mentioned this pull request Sep 11, 2018

[SPARK-21098] Set lineseparator csv multiline and csv write to \n #18304

Closed

HyukjinKwon mentioned this pull request Oct 12, 2018

[SPARK-23765][SQL] Supports custom line separator for json datasource #20877

Closed

HyukjinKwon mentioned this pull request Nov 22, 2018

[SPARK-26108][SQL] Support custom lineSep in CSV datasource #23080

Closed

HyukjinKwon deleted the linesep branch March 3, 2020 01:20

	* <li>`sep` (default `,`): sets the single character as a separator for each
	* field and value.</li>

[SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources #18581

[SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources #18581

Conversation

HyukjinKwon commented Jul 10, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jul 10, 2017

SparkQA commented Jul 10, 2017

SparkQA commented Jul 10, 2017

HyukjinKwon commented Jul 10, 2017

HyukjinKwon commented Jul 10, 2017

HyukjinKwon commented Jul 10, 2017 • edited Loading

HyukjinKwon commented Jul 10, 2017

HyukjinKwon commented Jul 10, 2017

SparkQA commented Jul 10, 2017

SparkQA commented Jul 10, 2017

SparkQA commented Jul 10, 2017

HyukjinKwon commented Jul 10, 2017

maropu commented Jul 10, 2017 • edited Loading

SparkQA commented Jul 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu Jul 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Jul 10, 2017

HyukjinKwon commented Jul 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 11, 2017

SparkQA commented Aug 23, 2017

SparkQA commented Aug 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Aug 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 28, 2017

HyukjinKwon commented Nov 12, 2017

Choose a reason for hiding this comment

HyukjinKwon Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 6, 2017

HyukjinKwon commented Dec 6, 2017

SparkQA commented Dec 6, 2017

cloud-fan commented Feb 26, 2018

HyukjinKwon commented Feb 26, 2018

gatorsmile commented Feb 28, 2018

HyukjinKwon commented Mar 3, 2018

HyukjinKwon commented Jul 10, 2017 •

edited

Loading

HyukjinKwon commented Jul 10, 2017 •

edited

Loading

maropu commented Jul 10, 2017 •

edited

Loading

maropu Jul 10, 2017 •

edited

Loading

gatorsmile Aug 27, 2017 •

edited

Loading

HyukjinKwon Dec 4, 2017 •

edited

Loading