Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources #18581

Closed
wants to merge 9 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jul 10, 2017

What changes were proposed in this pull request?

This PR proposes to add lineSep option for a configurable line separator in text-based datasources, LibSVM, JSON, CSV and Text.

Note that this PR follows Hive's default behaviour for \n - cover other newline variants.

How was this patch tested?

Unit tests in LibSVMRelationSuite.scala, CSVSuite.scala, JsonSuite.scala, TextSuite.scala and python/pyspark/sql/tests.py

@HyukjinKwon HyukjinKwon changed the title [SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources [WIP][SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources Jul 10, 2017
@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79442 has finished for PR 18581 at commit eb77c98.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79441 has finished for PR 18581 at commit a5410e5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79440 has finished for PR 18581 at commit 585a399.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources [SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources Jul 10, 2017
@HyukjinKwon HyukjinKwon changed the title [SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources [WIP][SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources Jul 10, 2017
@HyukjinKwon
Copy link
Member Author

Let me leave this as a WIP. Will be back after having some talk with Univocity author.

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources [SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources Jul 10, 2017
@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jul 10, 2017

I guess we need few more tests about CSV and changes. Let me leave this out in the documentation here for now.

Few more test cases should be about writing out with a custom line separator

["a\n\na", "\nb"]

and reading in value without quotes with a custom line separator

a\n\na,\nb

which currently do not work as expected with the current state. We need setNormalizedNewline in this case but it looks we need more decisions about which value we should set (this is a single character) and behaviour.

I can make a follow-up later if we are all fine with leaving this documentation out.

@HyukjinKwon
Copy link
Member Author

cc @cloud-fan, could you take a look please when you have some time?

@HyukjinKwon
Copy link
Member Author

Let me cc @gatorsmile and @maropu who I believe are interested in this.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79446 has finished for PR 18581 at commit eb77c98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79448 has finished for PR 18581 at commit eb77c98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79455 has finished for PR 18581 at commit fdd624f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Will clean up soon.

@maropu
Copy link
Member

maropu commented Jul 10, 2017

yea, thanks for pinging me! (sorry, but I didn't notice your JIRA ping) I'll check and comment on it later.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79465 has finished for PR 18581 at commit 4592b5b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -41,11 +41,15 @@ private[libsvm] class LibSVMOptions(@transient private val parameters: CaseInsen
case o => throw new IllegalArgumentException(s"Invalid value `$o` for parameter " +
s"`$VECTOR_TYPE`. Expected types are `sparse` and `dense`.")
}

val lineSeparator: Option[String] = parameters.get(LINE_SEPARATOR)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need multi-characters for the separator? Hive assumes a single character in LINES TERMINATED BY https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, actually in case of Univocity, it requires 2 characters:

CsvWriterSettings settings = new CsvWriterSettings();
settings.getFormat().setLineSeparator("aaa");
Exception in thread "main" java.lang.IllegalArgumentException: Invalid line separator. Up to 2 characters are expected. Got 3 characters.
	at com.univocity.parsers.common.Format.setLineSeparator(Format.java:121)
	at com.univocity.parsers.common.Format.setLineSeparator(Format.java:109)

I don't see a reason to restrict this for now. At least, I can provide an usecase with Windows -\r\n.

Copy link
Member

@maropu maropu Jul 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, ok. But I couldn't just imagine an usecase to use more than two characters. It's okay to me that we'll follow committers decisions on this. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen datasets that have multi-character delimiters of more than 2 characters. Specifically |~|

So yes there is a use case, but it's a long tail one. I'd be happy to get this progress of up to 2 characters and work towards 3+ in a future PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put this option in a single place for these formats? I feel putting this option in each format looks a little messy...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we support one or two characters only, I feel we better explicitly throw an exception for more than two characters here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compression is also there for many datasources. Probably, let me try to open up a discussion about tying up those later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks!

@@ -628,6 +630,12 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* spark.read().text("/path/to/spark/README.md")
* }}}
*
* You can set the following text-specific options to deal with text files:
* <ul>
* <li>`lineSep` (default is `\r\n` or `\n`): defines the line separator that should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about explicitly setting \n by default along with writing cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking in that way. However, this appears a default behaviour from Hadoop's LineRecordReader. I am worried of a behaviour change (e.g., I guess Windows users get affected by this) .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, how about using a platform-dependent separator in writing cases? If we keep the existing behaviour, I feel at least we better document more about it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds tricky but I guess we don't want a platform-dependent one - SPARK-18076. Up to my knowledge, LineRecordReader should cover both \r\n and \n. I will try to improve this documentation when cleaning other comments together if we are okay with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks!

@maropu
Copy link
Member

maropu commented Jul 10, 2017

Looks good to support an user-specified separator for CSV. As you suggested in the JIRA, I also feel we better file a JIRA for deprecate encoding supports in CSV and start to discuss there.

@HyukjinKwon
Copy link
Member Author

I will definitely open a JIRA to discussion further. Thank you for your head-up.

@@ -234,6 +234,8 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
:param multiLine: parse one record, which may span multiple lines, per file. If None is
set, it uses the default value, ``false``.
:param lineSep: defines the line separator that should be used for parsing. If None is
set, it uses the default value, ``\\r\\n`` or ``\\n``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has a different value for reading and writing individually. I tried to not change both. So, the related pointers for JSON read should be ...

https://github.com/apache/spark/pull/18581/files#diff-5ac20b8d75a20117deaa9ba4af814090R142 for data

https://github.com/apache/spark/pull/18581/files#diff-5ac20b8d75a20117deaa9ba4af814090R132 for schema inference.

@SparkQA
Copy link

SparkQA commented Jul 11, 2017

Test build #79522 has finished for PR 18581 at commit 974357b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81034 has finished for PR 18581 at commit f357573.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81036 has finished for PR 18581 at commit 87bb599.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

private[libsvm] object LibSVMOptions {
val NUM_FEATURES = "numFeatures"
val VECTOR_TYPE = "vectorType"
val DENSE_VECTOR_TYPE = "dense"
val SPARSE_VECTOR_TYPE = "sparse"
val LINE_SEPARATOR = "lineSep"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the full name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name came after sep in CSV which resembled R. Do you prefer separator and lineSeparator?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either lineDelimiter or lineSeparator looks fine.

In the future, we could also support field delimiters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually meant sep here:

* <li>`sep` (default `,`): sets the single character as a separator for each
* field and value.</li>

and was thinking of matching the name ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for history, it was delimiter but renamed to sep.

@@ -32,7 +32,9 @@ import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
* in that file.
*/
class HadoopFileLinesReader(
file: PartitionedFile, conf: Configuration) extends Iterator[Text] with Closeable {
file: PartitionedFile,
lineSeparator: Option[String],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate why? LineRecordReader without this works differently, covering newline variants by default. I don't know which string I should give for LineRecordReader constructor if this is not optional.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not know what is the default line separator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is LineRecordReader() here. I think probably we could do LineRecordReader(null) to express both \r\n and \n but I am not sure if we should use null to express these.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the line delimiter is '\n', any of the follow sequences will count as a delimiter: "\n", "\r\n", or "\r".

Could you check whether this is true here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\n = LF:

Multics, Unix and Unix-like systems (Linux, macOS, FreeBSD, AIX, Xenix, etc.), BeOS, Amiga, RISC OS, and others[1]

\r\n CR+LF:

Microsoft Windows, DOS (MS-DOS, PC DOS, etc.), DEC TOPS-10, RT-11, CP/M, MP/M, Atari TOS, OS/2, Symbian OS, Palm OS, Amstrad CPC, and most other early non-Unix and non-IBM operating systems

\r = CR:

Commodore 8-bit machines, Acorn BBC, ZX Spectrum, TRS-80, Apple II family, Oberon, the classic Mac OS, MIT Lisp Machine and OS-9

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like Hive's behavior is reasonable. For most external users, \n means a new line. Thus, it should match any of these three options: LF, CR+LF and CR. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking \n means the first case of your comment above as it is set by the user explicitly. So, I thought If it is not given, it covers three cases of newlines by default. If we use \n to deal with three cases above, wouldn't we are unable to cover, the arguably corner case of only handling \n?

Copy link
Member

@gatorsmile gatorsmile Aug 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, following Hive is the safest. If users complain about it, we can behave differently from Hive with using a new SQLConf.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Will change this but I should say this way looks incorrect to me and this behaviour should be discussed and possibly updated in the near future.

Files.write(path.toPath, lines.getBytes(StandardCharsets.UTF_8))
val df = spark.read
.option("multiLine", multiLine)
.option("lineSep", "|")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case should be a function. We can pass different separators and verify whether it returns the expected results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me give a try.

@SparkQA
Copy link

SparkQA commented Oct 28, 2017

Test build #83168 has finished for PR 18581 at commit a240973.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

gentle ping @gatorsmile

val path1 = new File(tempDir.getCanonicalPath, "write1")
try {
// Read
java.nio.file.Files.write(path0.toPath, lines.getBytes(StandardCharsets.UTF_8))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not import this ? java.nio.file.Files

Copy link
Member Author

@HyukjinKwon HyukjinKwon Dec 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To differentiate it from google's Files explicitly above. Not a big deal.

val row1 = df.first()
assert(row1.getDouble(0) == 1.0)
val v = row1.getAs[SparseVector](1)
assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use === instead of ==

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to ask why actually. I had some discussion about this and ended up without conclusion. The doc says === is preferred but the actual error messages are even clear with == sometimes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I prefer to keep consistent with others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK but let me use ==. Seems it's used in the test cases of this file.

assert(row1.getDouble(0) == 1.0)
val v = row1.getAs[SparseVector](1)
assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0))))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here you only test the first line ?
Why not use df.collect() to test every line ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just test the first line?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following test only include checking df and readbackDF equality. But, it seems we also need test the whole loaded df and raw file content equality.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we change how to deal with each line in iteration. I think both comparing single line or repeated multiple lines are fine. I think many tests here already test only first line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me update it. It's easy to change anyway.

@SparkQA
Copy link

SparkQA commented Dec 6, 2017

Test build #84531 has finished for PR 18581 at commit 265dd48.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 6, 2017

Test build #84538 has finished for PR 18581 at commit 265dd48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

It looks like this line separator has to be handled by each data source individually, can we start with, e.g., json, and then csv, text, etc.? Then we can have smaller PRs that would be easier to review.

@HyukjinKwon
Copy link
Member Author

Sure, will try to separate this. Will update my PRs soon roughly within this week.

@gatorsmile
Copy link
Member

Thanks!

@HyukjinKwon
Copy link
Member Author

I opened #20727 for text datasource.

@cloud-fan, other text-based sources depend on text datasource in schema inference path so I made a fix for text datasource first. Please check out if that makes sense when you are available.

@beickhoff
Copy link

@HyukjinKwon, is there another PR to handle CSV?

@HyukjinKwon
Copy link
Member Author

Nope not yet, I will try to make it within the next release soon.

@don4of4
Copy link

don4of4 commented Nov 4, 2018

Was this finished and merged in? I see https://issues.apache.org/jira/browse/SPARK-21289 is still open.

@HyukjinKwon
Copy link
Member Author

What you see is what you get. It's not yet finished. See also #20877 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants