[SPARK-21098] Set lineseparator csv multiline and csv write to \n #18304

danielvdende · 2017-06-14T18:45:13Z

This commit sets the lineseparator for reading a multiline csv file
or writing a csv file. We cannot make this configurable for reading
as it depends on LineReader from Hadoop, which has a hardcoded
\n as line ending.

What changes were proposed in this pull request?

The Univocity-parser library uses the system line ending character as the default line ending character. Rather than remain dependent on the setting in this lib, we could set the default to \n. We cannot make this configurable for reading as it depends on LineReader from Hadoop, which has a hardcoded \n as line ending.

How was this patch tested?

Existing tests all pass.

HyukjinKwon · 2017-06-14T23:34:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala

@@ -90,6 +90,7 @@ class CSVOptions(
  val quote = getChar("quote", '\"')
  val escape = getChar("escape", '\\')
  val comment = getChar("comment", '\u0000')
+  val lineSeparator = parameters.getOrElse("lineSeparator", "\n")


Could we read back with this custom line separator? For writing path, hard coded \n was removed before as it does not affect but it looks now dependent on the univocity library -
#16428 (comment)

Hi, thanks for pointing this out, I wasn't aware of this. Looking at it though, how would you propose fixing this? We would need a writerSettings instance to fetch the univocity default I think. Also, looking at the other parameters, I also see hardcoded values there (for instance val quote = getChar("quote", '\"')). Personally, I like the idea of being able to see the default used directly in Spark, to prevent changes in the univocity lib from impacting Spark usage (and requiring confusion when using).

Yea, that's why I haven't proposed this change so far (for both hard corded \n and configurable line separator).

For the hard corded \n, IMHO, reviving hard corded \n is probably fine for now as consistent with JSON/TEXT datasources because It uses, after the change I described above, OS-dependent newlline by default in Univocity - https://github.com/uniVocity/univocity-parsers/blob/e34a149077147de7cafcafbbaf8a4310e6297584/src/main/java/com/univocity/parsers/common/Format.java#L78, which is machine-dependent. If we are worried of changing this, we could use System.getProperty("line.separator") or just not set via Option. This probably does not block this PR.

For the configurable line separator, we are not able to read it back when wholeFile (or miltiLine now) option is disabled (it's default and this should be primary) because we are dependent of LineRecordReader which only deals with \n or \r\n. We really should be able to read it back.

To cut this short, up to my knowledge, for the current status (please correct me if I am wrong),

CSV read -> cover \n and \r\n
CSV read with wholeFile (multiLine) -> OS dependent newline (by Univocity)
CSV write -> OS dependent newline (by Univocity)

JSON read -> cover \n and \r\n
JSON read with wholeFile (multiLine) -> N/A (it reads the whole file as a single record for the current status)
JSON write -> \n

TEXT read -> cover \n and \r\n
TEXT write -> \n

For hardcorded newline, fixing it to \n is probably better. I wouldn't mind if this PR is turned to fix this for 'CSV read with wholeFile (multiLine)' and 'CSV write'.

For configurable line separator, it would be not configurable in this way for 'CSV read'. We should be able to read it back what we write out.

Ah, I think I understand now, thanks.

Maybe I'm overlooking it, but I don't see the wholeFile option being used in CsvOptions at all at the moment (nor multiline) ?

I'll take a further look at fixing this up considering the dependency on LineRecordReader (which only supports the two line separators you mentioned)

Ah, it was wholeFile but just renamed 2051428 to multiLine four hours ago.

Yes, it would be wonderful if we can support it in reading too. If it is difficult, it is fine (to me) to turn this PR for hard corded one.

Yeah, I did some digging, but that will probably be a lot trickier, because it would involve making changes in the Hadoop repo as well. For now, I've changed the content of this PR to only contain the hardcoding to \n as you suggested. I may take a look at the Hadoop code that would require changing as well, but that would require a lot more work than the scope of this story suggests.

(Just FYI, about one and a half years ago, I proposed this - https://github.com/apache/spark/pull/11016/files#diff-6a14f6bb643b1474139027d72a17f41aR141. It might probably be helpful if you are willing to take a look further.)

Updated the PR (had already done the Jira and the commit).

I'll definitely take a look at that proposal. It seems like a very valuable addition to Spark (I've heard a few people around me want the ability to set the line separator themselves)

This commit sets the lineseparator for reading a multiline csv file or writing a csv file. We cannot make this configurable for reading as it depends on LineReader from Hadoop, which has a hardcoded \n as line ending.

HyukjinKwon · 2017-06-15T12:43:19Z

BTW, let's update JIRA/PR too when you have some time.

danielvdende · 2017-06-26T11:29:23Z

@HyukjinKwon any further update on this?

HyukjinKwon · 2017-07-24T02:32:12Z

Would you mind if I ask to wait the resolution of https://github.com/apache/spark/pull/18581 ? Strictly, they are orthogonal as that PR tries to not change the default line separator but I would like to wait and see other opinions on this.

cse68197 · 2018-05-24T12:56:12Z

Could you please verify if this has been fixed or not?

cse68197 · 2018-05-24T13:04:12Z

I am writing data to a file like below-
allDF.rdd.map(rec => rec.mkString("|")).repartition(1).saveAsTextFile("location for file")

but when I opening that file in notepad, that is opening in single line but the same file is opening fine in notepad++ and I can see all the data in new lines.

I tried with below options (one by one) before saving as well but those also not worked.
spark.conf.set("textinputformat.record.delimeter","\r\n")
spark.conf.set("textinputformat.record.delimeter","\n")

So could you please help me to understand the any alternative way to fix it?

AmplabJenkins · 2018-06-09T00:21:23Z

Can one of the admins verify this patch?

dongjoon-hyun · 2018-09-11T17:07:26Z

Is this still valid, @cse68197 and @HyukjinKwon ?
#18581 is superceded by #20727 . And #20727 seems to be merged. Are we waiting for the others?

dongjoon-hyun · 2018-09-16T04:50:29Z

Gentle ping. If the issue is resolved, please close this PR, @cse68197 .

HyukjinKwon · 2018-09-17T01:46:06Z

CSV's lineSep is not added yet. The problem here is specific to CSV - we are os-dependent on the newline separator by Univocity's which is not the case in Jackson and which can be worked around when CSV's newline option is added. I was working on this feature but faced some problems with handling multiLine in CSV. Will make a PR when I'm available.

@danielvdende, let's leave this closed for now. Will ping you in the PR I will open later.

danielvdende · 2018-09-17T06:42:49Z

Sounds good to me 👍 @HyukjinKwon

dongjoon-hyun · 2018-09-17T18:45:04Z

Thank you so much for the new decision, @HyukjinKwon and @danielvdende !

Then, please close this PR for now, @danielvdende .

dongjoon-hyun · 2018-09-27T17:40:53Z

Ping, @danielvdende .

dongjoon-hyun · 2018-09-27T18:33:31Z

Thanks!

SAKSHIC · 2018-10-07T06:07:28Z

custom record delimiter to read the file by setting the parameter (textinputformat.record.delimeter) is working. How to set the custom record delimiter for writing the file?

HyukjinKwon · 2018-11-23T15:55:42Z

@danielvdende now the newlines are automatically derected. should be not an issue anymore.

fusionet24 · 2019-03-03T14:49:55Z

To echo @SAKSHIC, how do we set custom record delimiters when writing files please?

HyukjinKwon · 2019-03-04T03:30:51Z

It will be possible in Spark 3.0 as of #23080 (thanks, @MaxGekk)

HyukjinKwon reviewed Jun 14, 2017

View reviewed changes

danielvdende force-pushed the add-csv-line-separator-option branch from 9684a39 to f4d2ce5 Compare June 15, 2017 12:19

[SPARK-21098] Set lineseparator csv multiline and csv write

8bf5ead

This commit sets the lineseparator for reading a multiline csv file or writing a csv file. We cannot make this configurable for reading as it depends on LineReader from Hadoop, which has a hardcoded \n as line ending.

danielvdende force-pushed the add-csv-line-separator-option branch from f4d2ce5 to 8bf5ead Compare June 15, 2017 12:21

danielvdende changed the title ~~[SPARK-21098] Add lineseparator parameter to csv options~~ [SPARK-21098] Set lineseparator csv multiline and csv write to \n Jun 15, 2017

danielvdende closed this Sep 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21098] Set lineseparator csv multiline and csv write to \n #18304

[SPARK-21098] Set lineseparator csv multiline and csv write to \n #18304

danielvdende commented Jun 14, 2017 •

edited

Loading

HyukjinKwon Jun 14, 2017

danielvdende Jun 15, 2017

HyukjinKwon Jun 15, 2017

HyukjinKwon Jun 15, 2017

danielvdende Jun 15, 2017

HyukjinKwon Jun 15, 2017

danielvdende Jun 15, 2017

HyukjinKwon Jun 15, 2017 •

edited

Loading

danielvdende Jun 15, 2017

HyukjinKwon commented Jun 15, 2017

danielvdende commented Jun 26, 2017

HyukjinKwon commented Jul 24, 2017

cse68197 commented May 24, 2018 •

edited

Loading

cse68197 commented May 24, 2018

AmplabJenkins commented Jun 9, 2018

dongjoon-hyun commented Sep 11, 2018

dongjoon-hyun commented Sep 16, 2018

HyukjinKwon commented Sep 17, 2018

danielvdende commented Sep 17, 2018

dongjoon-hyun commented Sep 17, 2018 •

edited

Loading

dongjoon-hyun commented Sep 27, 2018

dongjoon-hyun commented Sep 27, 2018

SAKSHIC commented Oct 7, 2018

HyukjinKwon commented Nov 23, 2018

fusionet24 commented Mar 3, 2019

HyukjinKwon commented Mar 4, 2019

[SPARK-21098] Set lineseparator csv multiline and csv write to \n #18304

[SPARK-21098] Set lineseparator csv multiline and csv write to \n #18304

Conversation

danielvdende commented Jun 14, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Jun 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jun 15, 2017

danielvdende commented Jun 26, 2017

HyukjinKwon commented Jul 24, 2017

cse68197 commented May 24, 2018 • edited Loading

cse68197 commented May 24, 2018

AmplabJenkins commented Jun 9, 2018

dongjoon-hyun commented Sep 11, 2018

dongjoon-hyun commented Sep 16, 2018

HyukjinKwon commented Sep 17, 2018

danielvdende commented Sep 17, 2018

dongjoon-hyun commented Sep 17, 2018 • edited Loading

dongjoon-hyun commented Sep 27, 2018

dongjoon-hyun commented Sep 27, 2018

SAKSHIC commented Oct 7, 2018

HyukjinKwon commented Nov 23, 2018

fusionet24 commented Mar 3, 2019

HyukjinKwon commented Mar 4, 2019

danielvdende commented Jun 14, 2017 •

edited

Loading

HyukjinKwon Jun 15, 2017 •

edited

Loading

cse68197 commented May 24, 2018 •

edited

Loading

dongjoon-hyun commented Sep 17, 2018 •

edited

Loading