-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources #18581
Conversation
Test build #79442 has finished for PR 18581 at commit
|
Test build #79441 has finished for PR 18581 at commit
|
Test build #79440 has finished for PR 18581 at commit
|
retest this please |
Let me leave this as a WIP. Will be back after having some talk with Univocity author. |
I guess we need few more tests about CSV and changes. Let me leave this out in the documentation here for now. Few more test cases should be about writing out with a custom line separator
and reading in value without quotes with a custom line separator
which currently do not work as expected with the current state. We need I can make a follow-up later if we are all fine with leaving this documentation out. |
cc @cloud-fan, could you take a look please when you have some time? |
Let me cc @gatorsmile and @maropu who I believe are interested in this. |
Test build #79446 has finished for PR 18581 at commit
|
Test build #79448 has finished for PR 18581 at commit
|
Test build #79455 has finished for PR 18581 at commit
|
Will clean up soon. |
yea, thanks for pinging me! (sorry, but I didn't notice your JIRA ping) I'll check and comment on it later. |
Test build #79465 has finished for PR 18581 at commit
|
@@ -41,11 +41,15 @@ private[libsvm] class LibSVMOptions(@transient private val parameters: CaseInsen | |||
case o => throw new IllegalArgumentException(s"Invalid value `$o` for parameter " + | |||
s"`$VECTOR_TYPE`. Expected types are `sparse` and `dense`.") | |||
} | |||
|
|||
val lineSeparator: Option[String] = parameters.get(LINE_SEPARATOR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need multi-characters for the separator? Hive assumes a single character in LINES TERMINATED BY
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, actually in case of Univocity, it requires 2 characters:
CsvWriterSettings settings = new CsvWriterSettings();
settings.getFormat().setLineSeparator("aaa");
Exception in thread "main" java.lang.IllegalArgumentException: Invalid line separator. Up to 2 characters are expected. Got 3 characters.
at com.univocity.parsers.common.Format.setLineSeparator(Format.java:121)
at com.univocity.parsers.common.Format.setLineSeparator(Format.java:109)
I don't see a reason to restrict this for now. At least, I can provide an usecase with Windows -\r\n
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, ok. But I couldn't just imagine an usecase to use more than two characters. It's okay to me that we'll follow committers decisions on this. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen datasets that have multi-character delimiters of more than 2 characters. Specifically |~|
So yes there is a use case, but it's a long tail one. I'd be happy to get this progress of up to 2 characters and work towards 3+ in a future PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put this option in a single place for these formats? I feel putting this option in each format looks a little messy...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we support one or two characters only, I feel we better explicitly throw an exception for more than two characters here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compression
is also there for many datasources. Probably, let me try to open up a discussion about tying up those later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks!
@@ -628,6 +630,12 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { | |||
* spark.read().text("/path/to/spark/README.md") | |||
* }}} | |||
* | |||
* You can set the following text-specific options to deal with text files: | |||
* <ul> | |||
* <li>`lineSep` (default is `\r\n` or `\n`): defines the line separator that should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about explicitly setting \n
by default along with writing cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking in that way. However, this appears a default behaviour from Hadoop's LineRecordReader
. I am worried of a behaviour change (e.g., I guess Windows users get affected by this) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or, how about using a platform-dependent separator in writing cases? If we keep the existing behaviour, I feel at least we better document more about it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds tricky but I guess we don't want a platform-dependent one - SPARK-18076. Up to my knowledge, LineRecordReader
should cover both \r\n
and \n
. I will try to improve this documentation when cleaning other comments together if we are okay with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks!
Looks good to support an user-specified separator for CSV. As you suggested in the JIRA, I also feel we better file a JIRA for deprecate |
I will definitely open a JIRA to discussion further. Thank you for your head-up. |
python/pyspark/sql/readwriter.py
Outdated
@@ -234,6 +234,8 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, | |||
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``. | |||
:param multiLine: parse one record, which may span multiple lines, per file. If None is | |||
set, it uses the default value, ``false``. | |||
:param lineSep: defines the line separator that should be used for parsing. If None is | |||
set, it uses the default value, ``\\r\\n`` or ``\\n``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems the default value is always \n
? https://github.com/apache/spark/pull/18581/files#diff-059fbd7487f6bec7cbd8a57f21f8c8c5R48
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has a different value for reading and writing individually. I tried to not change both. So, the related pointers for JSON read should be ...
https://github.com/apache/spark/pull/18581/files#diff-5ac20b8d75a20117deaa9ba4af814090R142 for data
https://github.com/apache/spark/pull/18581/files#diff-5ac20b8d75a20117deaa9ba4af814090R132 for schema inference.
Test build #79522 has finished for PR 18581 at commit
|
Test build #81034 has finished for PR 18581 at commit
|
Test build #81036 has finished for PR 18581 at commit
|
} | ||
|
||
private[libsvm] object LibSVMOptions { | ||
val NUM_FEATURES = "numFeatures" | ||
val VECTOR_TYPE = "vectorType" | ||
val DENSE_VECTOR_TYPE = "dense" | ||
val SPARSE_VECTOR_TYPE = "sparse" | ||
val LINE_SEPARATOR = "lineSep" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the full name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name came after sep
in CSV which resembled R. Do you prefer separator
and lineSeparator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either lineDelimiter
or lineSeparator
looks fine.
In the future, we could also support field delimiters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually meant sep
here:
spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
Lines 488 to 489 in 3c0c2d0
* <li>`sep` (default `,`): sets the single character as a separator for each | |
* field and value.</li> |
and was thinking of matching the name ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for history, it was delimiter
but renamed to sep
.
@@ -32,7 +32,9 @@ import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl | |||
* in that file. | |||
*/ | |||
class HadoopFileLinesReader( | |||
file: PartitionedFile, conf: Configuration) extends Iterator[Text] with Closeable { | |||
file: PartitionedFile, | |||
lineSeparator: Option[String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate why? LineRecordReader
without this works differently, covering newline variants by default. I don't know which string I should give for LineRecordReader
constructor if this is not optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not know what is the default line separator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is LineRecordReader()
here. I think probably we could do LineRecordReader(null)
to express both \r\n
and \n
but I am not sure if we should use null
to express these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the line delimiter is '\n', any of the follow sequences will count as a delimiter: "\n", "\r\n", or "\r".
Could you check whether this is true here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\n = LF:
Multics, Unix and Unix-like systems (Linux, macOS, FreeBSD, AIX, Xenix, etc.), BeOS, Amiga, RISC OS, and others[1]
\r\n CR+LF:
Microsoft Windows, DOS (MS-DOS, PC DOS, etc.), DEC TOPS-10, RT-11, CP/M, MP/M, Atari TOS, OS/2, Symbian OS, Palm OS, Amstrad CPC, and most other early non-Unix and non-IBM operating systems
\r = CR:
Commodore 8-bit machines, Acorn BBC, ZX Spectrum, TRS-80, Apple II family, Oberon, the classic Mac OS, MIT Lisp Machine and OS-9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like Hive's behavior is reasonable. For most external users, \n
means a new line. Thus, it should match any of these three options: LF
, CR+LF
and CR
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking \n
means the first case of your comment above as it is set by the user explicitly. So, I thought If it is not given, it covers three cases of newlines by default. If we use \n
to deal with three cases above, wouldn't we are unable to cover, the arguably corner case of only handling \n
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, following Hive is the safest. If users complain about it, we can behave differently from Hive with using a new SQLConf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Will change this but I should say this way looks incorrect to me and this behaviour should be discussed and possibly updated in the near future.
Files.write(path.toPath, lines.getBytes(StandardCharsets.UTF_8)) | ||
val df = spark.read | ||
.option("multiLine", multiLine) | ||
.option("lineSep", "|") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case should be a function. We can pass different separators and verify whether it returns the expected results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me give a try.
Test build #83168 has finished for PR 18581 at commit
|
gentle ping @gatorsmile |
val path1 = new File(tempDir.getCanonicalPath, "write1") | ||
try { | ||
// Read | ||
java.nio.file.Files.write(path0.toPath, lines.getBytes(StandardCharsets.UTF_8)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not import this ? java.nio.file.Files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To differentiate it from google's Files
explicitly above. Not a big deal.
val row1 = df.first() | ||
assert(row1.getDouble(0) == 1.0) | ||
val v = row1.getAs[SparseVector](1) | ||
assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ===
instead of ==
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to ask why actually. I had some discussion about this and ended up without conclusion. The doc says ===
is preferred but the actual error messages are even clear with ==
sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I prefer to keep consistent with others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK but let me use ==
. Seems it's used in the test cases of this file.
assert(row1.getDouble(0) == 1.0) | ||
val v = row1.getAs[SparseVector](1) | ||
assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0)))) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here you only test the first line ?
Why not use df.collect() to test every line ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just test the first line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following test only include checking df
and readbackDF
equality. But, it seems we also need test the whole loaded df
and raw file content equality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we change how to deal with each line in iteration. I think both comparing single line or repeated multiple lines are fine. I think many tests here already test only first line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me update it. It's easy to change anyway.
Test build #84531 has finished for PR 18581 at commit
|
retest this please |
Test build #84538 has finished for PR 18581 at commit
|
It looks like this line separator has to be handled by each data source individually, can we start with, e.g., json, and then csv, text, etc.? Then we can have smaller PRs that would be easier to review. |
Sure, will try to separate this. Will update my PRs soon roughly within this week. |
Thanks! |
I opened #20727 for text datasource. @cloud-fan, other text-based sources depend on text datasource in schema inference path so I made a fix for text datasource first. Please check out if that makes sense when you are available. |
@HyukjinKwon, is there another PR to handle CSV? |
Nope not yet, I will try to make it within the next release soon. |
Was this finished and merged in? I see https://issues.apache.org/jira/browse/SPARK-21289 is still open. |
What you see is what you get. It's not yet finished. See also #20877 (comment) |
What changes were proposed in this pull request?
This PR proposes to add
lineSep
option for a configurable line separator in text-based datasources, LibSVM, JSON, CSV and Text.Note that this PR follows Hive's default behaviour for
\n
- cover other newline variants.How was this patch tested?
Unit tests in
LibSVMRelationSuite.scala
,CSVSuite.scala
,JsonSuite.scala
,TextSuite.scala
andpython/pyspark/sql/tests.py