-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21098] Set lineseparator csv multiline and csv write to \n #18304
[SPARK-21098] Set lineseparator csv multiline and csv write to \n #18304
Conversation
@@ -90,6 +90,7 @@ class CSVOptions( | |||
val quote = getChar("quote", '\"') | |||
val escape = getChar("escape", '\\') | |||
val comment = getChar("comment", '\u0000') | |||
val lineSeparator = parameters.getOrElse("lineSeparator", "\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we read back with this custom line separator? For writing path, hard coded \n
was removed before as it does not affect but it looks now dependent on the univocity library -
#16428 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for pointing this out, I wasn't aware of this. Looking at it though, how would you propose fixing this? We would need a writerSettings
instance to fetch the univocity default I think. Also, looking at the other parameters, I also see hardcoded values there (for instance val quote = getChar("quote", '\"')
). Personally, I like the idea of being able to see the default used directly in Spark, to prevent changes in the univocity lib from impacting Spark usage (and requiring confusion when using).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's why I haven't proposed this change so far (for both hard corded \n
and configurable line separator).
For the hard corded \n
, IMHO, reviving hard corded \n
is probably fine for now as consistent with JSON/TEXT datasources because It uses, after the change I described above, OS-dependent newlline by default in Univocity - https://github.com/uniVocity/univocity-parsers/blob/e34a149077147de7cafcafbbaf8a4310e6297584/src/main/java/com/univocity/parsers/common/Format.java#L78, which is machine-dependent. If we are worried of changing this, we could use System.getProperty("line.separator")
or just not set via Option
. This probably does not block this PR.
For the configurable line separator, we are not able to read it back when wholeFile
(or miltiLine
now) option is disabled (it's default and this should be primary) because we are dependent of LineRecordReader
which only deals with \n
or \r\n
. We really should be able to read it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To cut this short, up to my knowledge, for the current status (please correct me if I am wrong),
CSV read -> cover \n
and \r\n
CSV read with wholeFile
(multiLine
) -> OS dependent newline (by Univocity)
CSV write -> OS dependent newline (by Univocity)
JSON read -> cover \n
and \r\n
JSON read with wholeFile
(multiLine
) -> N/A (it reads the whole file as a single record for the current status)
JSON write -> \n
TEXT read -> cover \n
and \r\n
TEXT write -> \n
For hardcorded newline, fixing it to \n
is probably better. I wouldn't mind if this PR is turned to fix this for 'CSV read with wholeFile (multiLine)' and 'CSV write'.
For configurable line separator, it would be not configurable in this way for 'CSV read'. We should be able to read it back what we write out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think I understand now, thanks.
Maybe I'm overlooking it, but I don't see the wholeFile
option being used in CsvOptions
at all at the moment (nor multiline
) ?
I'll take a further look at fixing this up considering the dependency on LineRecordReader
(which only supports the two line separators you mentioned)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it was wholeFile
but just renamed 2051428 to multiLine
four hours ago.
Yes, it would be wonderful if we can support it in reading too. If it is difficult, it is fine (to me) to turn this PR for hard corded one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I did some digging, but that will probably be a lot trickier, because it would involve making changes in the Hadoop repo as well. For now, I've changed the content of this PR to only contain the hardcoding to \n
as you suggested. I may take a look at the Hadoop code that would require changing as well, but that would require a lot more work than the scope of this story suggests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just FYI, about one and a half years ago, I proposed this - https://github.com/apache/spark/pull/11016/files#diff-6a14f6bb643b1474139027d72a17f41aR141. It might probably be helpful if you are willing to take a look further.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR (had already done the Jira and the commit).
I'll definitely take a look at that proposal. It seems like a very valuable addition to Spark (I've heard a few people around me want the ability to set the line separator themselves)
9684a39
to
f4d2ce5
Compare
This commit sets the lineseparator for reading a multiline csv file or writing a csv file. We cannot make this configurable for reading as it depends on LineReader from Hadoop, which has a hardcoded \n as line ending.
f4d2ce5
to
8bf5ead
Compare
BTW, let's update JIRA/PR too when you have some time. |
@HyukjinKwon any further update on this? |
Would you mind if I ask to wait the resolution of |
Could you please verify if this has been fixed or not? |
I am writing data to a file like below- but when I opening that file in notepad, that is opening in single line but the same file is opening fine in notepad++ and I can see all the data in new lines. I tried with below options (one by one) before saving as well but those also not worked. So could you please help me to understand the any alternative way to fix it? |
Can one of the admins verify this patch? |
Is this still valid, @cse68197 and @HyukjinKwon ? |
Gentle ping. If the issue is resolved, please close this PR, @cse68197 . |
CSV's @danielvdende, let's leave this closed for now. Will ping you in the PR I will open later. |
Sounds good to me 👍 @HyukjinKwon |
Thank you so much for the new decision, @HyukjinKwon and @danielvdende ! Then, please close this PR for now, @danielvdende . |
Ping, @danielvdende . |
Thanks! |
custom record delimiter to read the file by setting the parameter (textinputformat.record.delimeter) is working. How to set the custom record delimiter for writing the file? |
@danielvdende now the newlines are automatically derected. should be not an issue anymore. |
To echo @SAKSHIC, how do we set custom record delimiters when writing files please? |
This commit sets the lineseparator for reading a multiline csv file
or writing a csv file. We cannot make this configurable for reading
as it depends on LineReader from Hadoop, which has a hardcoded
\n as line ending.
What changes were proposed in this pull request?
The Univocity-parser library uses the system line ending character as the default line ending character. Rather than remain dependent on the setting in this lib, we could set the default to \n. We cannot make this configurable for reading as it depends on LineReader from Hadoop, which has a hardcoded \n as line ending.
How was this patch tested?
Existing tests all pass.