Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21098] Set lineseparator csv multiline and csv write to \n #18304

Closed

Conversation

danielvdende
Copy link
Contributor

@danielvdende danielvdende commented Jun 14, 2017

This commit sets the lineseparator for reading a multiline csv file
or writing a csv file. We cannot make this configurable for reading
as it depends on LineReader from Hadoop, which has a hardcoded
\n as line ending.

What changes were proposed in this pull request?

The Univocity-parser library uses the system line ending character as the default line ending character. Rather than remain dependent on the setting in this lib, we could set the default to \n. We cannot make this configurable for reading as it depends on LineReader from Hadoop, which has a hardcoded \n as line ending.

How was this patch tested?

Existing tests all pass.

@@ -90,6 +90,7 @@ class CSVOptions(
val quote = getChar("quote", '\"')
val escape = getChar("escape", '\\')
val comment = getChar("comment", '\u0000')
val lineSeparator = parameters.getOrElse("lineSeparator", "\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we read back with this custom line separator? For writing path, hard coded \n was removed before as it does not affect but it looks now dependent on the univocity library -
#16428 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for pointing this out, I wasn't aware of this. Looking at it though, how would you propose fixing this? We would need a writerSettings instance to fetch the univocity default I think. Also, looking at the other parameters, I also see hardcoded values there (for instance val quote = getChar("quote", '\"')). Personally, I like the idea of being able to see the default used directly in Spark, to prevent changes in the univocity lib from impacting Spark usage (and requiring confusion when using).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's why I haven't proposed this change so far (for both hard corded \n and configurable line separator).

For the hard corded \n, IMHO, reviving hard corded \n is probably fine for now as consistent with JSON/TEXT datasources because It uses, after the change I described above, OS-dependent newlline by default in Univocity - https://github.com/uniVocity/univocity-parsers/blob/e34a149077147de7cafcafbbaf8a4310e6297584/src/main/java/com/univocity/parsers/common/Format.java#L78, which is machine-dependent. If we are worried of changing this, we could use System.getProperty("line.separator") or just not set via Option. This probably does not block this PR.

For the configurable line separator, we are not able to read it back when wholeFile (or miltiLine now) option is disabled (it's default and this should be primary) because we are dependent of LineRecordReader which only deals with \n or \r\n. We really should be able to read it back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To cut this short, up to my knowledge, for the current status (please correct me if I am wrong),

CSV read -> cover \n and \r\n
CSV read with wholeFile (multiLine) -> OS dependent newline (by Univocity)
CSV write -> OS dependent newline (by Univocity)

JSON read -> cover \n and \r\n
JSON read with wholeFile (multiLine) -> N/A (it reads the whole file as a single record for the current status)
JSON write -> \n

TEXT read -> cover \n and \r\n
TEXT write -> \n

For hardcorded newline, fixing it to \n is probably better. I wouldn't mind if this PR is turned to fix this for 'CSV read with wholeFile (multiLine)' and 'CSV write'.

For configurable line separator, it would be not configurable in this way for 'CSV read'. We should be able to read it back what we write out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I think I understand now, thanks.

Maybe I'm overlooking it, but I don't see the wholeFile option being used in CsvOptions at all at the moment (nor multiline) ?

I'll take a further look at fixing this up considering the dependency on LineRecordReader (which only supports the two line separators you mentioned)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it was wholeFile but just renamed 2051428 to multiLine four hours ago.

Yes, it would be wonderful if we can support it in reading too. If it is difficult, it is fine (to me) to turn this PR for hard corded one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did some digging, but that will probably be a lot trickier, because it would involve making changes in the Hadoop repo as well. For now, I've changed the content of this PR to only contain the hardcoding to \n as you suggested. I may take a look at the Hadoop code that would require changing as well, but that would require a lot more work than the scope of this story suggests.

Copy link
Member

@HyukjinKwon HyukjinKwon Jun 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just FYI, about one and a half years ago, I proposed this - https://github.com/apache/spark/pull/11016/files#diff-6a14f6bb643b1474139027d72a17f41aR141. It might probably be helpful if you are willing to take a look further.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR (had already done the Jira and the commit).

I'll definitely take a look at that proposal. It seems like a very valuable addition to Spark (I've heard a few people around me want the ability to set the line separator themselves)

@danielvdende danielvdende force-pushed the add-csv-line-separator-option branch from 9684a39 to f4d2ce5 Compare June 15, 2017 12:19
This commit sets the lineseparator for reading a multiline csv file
or writing a csv file. We cannot make this configurable for reading
as it depends on LineReader from Hadoop, which has a hardcoded
\n as line ending.
@danielvdende danielvdende force-pushed the add-csv-line-separator-option branch from f4d2ce5 to 8bf5ead Compare June 15, 2017 12:21
@HyukjinKwon
Copy link
Member

BTW, let's update JIRA/PR too when you have some time.

@danielvdende danielvdende changed the title [SPARK-21098] Add lineseparator parameter to csv options [SPARK-21098] Set lineseparator csv multiline and csv write to \n Jun 15, 2017
@danielvdende
Copy link
Contributor Author

@HyukjinKwon any further update on this?

@HyukjinKwon
Copy link
Member

Would you mind if I ask to wait the resolution of https://github.com/apache/spark/pull/18581 ? Strictly, they are orthogonal as that PR tries to not change the default line separator but I would like to wait and see other opinions on this.

@cse68197
Copy link

cse68197 commented May 24, 2018

Could you please verify if this has been fixed or not?

@cse68197
Copy link

I am writing data to a file like below-
allDF.rdd.map(rec => rec.mkString("|")).repartition(1).saveAsTextFile("location for file")

but when I opening that file in notepad, that is opening in single line but the same file is opening fine in notepad++ and I can see all the data in new lines.

I tried with below options (one by one) before saving as well but those also not worked.
spark.conf.set("textinputformat.record.delimeter","\r\n")
spark.conf.set("textinputformat.record.delimeter","\n")

So could you please help me to understand the any alternative way to fix it?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@dongjoon-hyun
Copy link
Member

Is this still valid, @cse68197 and @HyukjinKwon ?
#18581 is superceded by #20727 . And #20727 seems to be merged. Are we waiting for the others?

@dongjoon-hyun
Copy link
Member

Gentle ping. If the issue is resolved, please close this PR, @cse68197 .

@HyukjinKwon
Copy link
Member

CSV's lineSep is not added yet. The problem here is specific to CSV - we are os-dependent on the newline separator by Univocity's which is not the case in Jackson and which can be worked around when CSV's newline option is added. I was working on this feature but faced some problems with handling multiLine in CSV. Will make a PR when I'm available.

@danielvdende, let's leave this closed for now. Will ping you in the PR I will open later.

@danielvdende
Copy link
Contributor Author

Sounds good to me 👍 @HyukjinKwon

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 17, 2018

Thank you so much for the new decision, @HyukjinKwon and @danielvdende !

Then, please close this PR for now, @danielvdende .

@dongjoon-hyun
Copy link
Member

Ping, @danielvdende .

@dongjoon-hyun
Copy link
Member

Thanks!

@SAKSHIC
Copy link

SAKSHIC commented Oct 7, 2018

custom record delimiter to read the file by setting the parameter (textinputformat.record.delimeter) is working. How to set the custom record delimiter for writing the file?

@HyukjinKwon
Copy link
Member

@danielvdende now the newlines are automatically derected. should be not an issue anymore.

@fusionet24
Copy link

To echo @SAKSHIC, how do we set custom record delimiters when writing files please?

@HyukjinKwon
Copy link
Member

It will be possible in Spark 3.0 as of #23080 (thanks, @MaxGekk)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants