-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23765][SQL] Supports custom line separator for json datasource #20877
Conversation
@@ -251,5 +254,8 @@ private[sql] class JacksonGenerator( | |||
mapType = dataType.asInstanceOf[MapType])) | |||
} | |||
|
|||
def writeLineEnding(): Unit = gen.writeRaw('\n') | |||
def writeLineEnding(): Unit = { | |||
// Note that JSON uses writer with UTF-8 charset. This string will be written out as UTF-8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant here:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala
Lines 88 to 93 in de36f65
def createOutputStreamWriter( | |
context: JobContext, | |
file: Path, | |
charset: Charset = StandardCharsets.UTF_8): OutputStreamWriter = { | |
new OutputStreamWriter(createOutputStream(context, file), charset) | |
} |
@@ -208,9 +208,11 @@ class TextSuite extends QueryTest with SharedSQLContext { | |||
} | |||
} | |||
|
|||
Seq("|", "^", "::", "!!!@3", 0x1E.toChar.toString).foreach { lineSep => | |||
// scalastyle:off nonascii | |||
Seq("|", "^", "::", "!!!@3", 0x1E.toChar.toString, "아").foreach { lineSep => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly unrelated but I just added. I am fine with reverting this out if it bugs anyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, "아" means just "ah" without any meaning ..
cc @cloud-fan, @MaxGekk and @hvanhovell, could you review this when you are available? |
Test build #88507 has finished for PR 20877 at commit
|
Test build #88509 has finished for PR 20877 at commit
|
retest this please |
Test build #88515 has finished for PR 20877 at commit
|
retest this please |
What about to make the option more flexible like in the PR: MaxGekk#1 ? It would be nice to handle JSON Streaming for example: https://en.wikipedia.org/wiki/JSON_streaming |
I am neutral. Does that fix actual usecases? I can help review anyway. Would you like to make a followup separately? |
Test build #88518 has finished for PR 20877 at commit
|
@HyukjinKwon We have a few clients who are interested in processing of JSON streaming like data. Here is the PR which combines your changes and mine: #20885 |
I see the following use cases:
|
Yup, yup. I don't object for now. Shall we merge this one first and talk more about it in your PR? |
I have only one concern: if we merge this PR, we close the possibility for changing format of |
I think we are fine to change the behaviour of |
retest this please |
@cloud-fan, @MaxGekk, and @hvanhovell, would you mind taking a look please when you have some time? I think this is pretty similar with #20727 except one difference that it uses text datasource's |
Test build #88567 has finished for PR 20877 at commit
|
retest this please |
Test build #88568 has finished for PR 20877 at commit
|
retest this please |
Test build #88570 has finished for PR 20877 at commit
|
LGTM |
@@ -268,6 +268,8 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo | |||
* `java.text.SimpleDateFormat`. This applies to timestamp type.</li> | |||
* <li>`multiLine` (default `false`): parse one record, which may span multiple lines, | |||
* per file</li> | |||
* <li>`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test case for testing the default covers \r
, \r\n
and \n
?
@MaxGekk Will you submit a PR for addressing the comment #20877 (comment) in the next few weeks? If so, we can hold this PR. |
He submitted this - #20885 and I believe we need more feedback and another review iteration. |
@gatorsmile The PR has been already submitted: #20885 . Frankly speaking I would prefer another name for the option like we discussed before: MaxGekk#1 but similar PR for text datasource had been merged already: #20727 . And I think it is more important to have the same option across all datasource for now than arguing about option name . That's why I renamed recordDelimiter to lineSep in #20885 / cc @rxin |
Correct me if I am wrong. None of renaming or adding more flexible functionality to the line separator blocks this PR, right? Even if we go renaming, we should do it for text datasource too which I believe is better to do it separately, and the flexible functionality in the line separator looks needing more feedback and discussion. |
We can also change both if they haven’t been released yet.
…On Sun, Mar 25, 2018 at 10:37 AM Maxim Gekk ***@***.***> wrote:
@gatorsmile <https://github.com/gatorsmile> The PR has been already
submitted: #20885 <#20885> . Frankly
speaking I would prefer another name for the option like we discussed
before: MaxGekk#1 <MaxGekk#1> but similar
PR for text datasource had been merged already: #20727
<#20727> . And I think it is more
important to have the same option across all datasource. That's why I
renamed *recordDelimiter* to *lineSep* in #20885
<#20885> / cc @rxin
<https://github.com/rxin>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20877 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AATvPKz5R1mF_QZcR0qPO-OBRoGZ3vIEks5th9XQgaJpZM4S2jpk>
.
|
Test build #88572 has finished for PR 20877 at commit
|
Yeah. |
Since both PRs are ready for review, let us review both and see which one is better |
There was a discussion about the naming here - #20727 (comment). I am against Both PR deal with different problems. This PR deals with line separator only and that PR deals with line separator + flexible option. |
If this one is merged, I believe it should be easier to review #20885 too. |
@@ -85,6 +86,16 @@ private[sql] class JSONOptions( | |||
|
|||
val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false) | |||
|
|||
val lineSeparator: Option[String] = parameters.get("lineSep").map { sep => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be private?
thanks, merging to master! |
@HyukjinKwon do you wanna send a PR to add |
yea, I will. I'll be busy for a while but I will make it in the next week for sure. |
thanks all for reviewing this. |
## What changes were proposed in this pull request? This PR proposes to add lineSep option for a configurable line separator in text datasource. It supports this option by using `LineRecordReader`'s functionality with passing it to the constructor. The approach is similar with apache#20727; however, one main difference is, it uses text datasource's `lineSep` option to parse line by line in JSON's schema inference. ## How was this patch tested? Manually tested and unit tests were added. Author: hyukjinkwon <gurwls223@apache.org> Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#20877 from HyukjinKwon/linesep-json.
@MaxGekk, are you busy? Do you have some time to go for CSV's lineSep? I think I wouldn't have some time within a couple of weeks. If you have some time, I would appreciate if you could go ahead. Otherwise, I will try this one after a couple of weeks. The problem in CSV's lineSep is about multiline support. As you might already know, CSV's multiline mode is different with JSON in a way it parses line by line from the stream whereas JSON treats it as a whole record in general - so we should set the lineSep to Univocity parser as well. The problem is, There are some changes made in #18581 . Might able to extract CSV related change and make some addition and deletion. If it's difficult to support |
@HyukjinKwon I will be on a vacation for 3 weeks but highly likely I will be in a place where there is no internet and even mobile networks at all. Yeh there are such places in Russia ;-) . So, even if I prepare a PR, I will be not able to response to any comments. |
Ah, happy vacation! |
@justinuang are you interested in taking over #20877 (comment) ? |
Sorry, I won't be able to take it over! |
What changes were proposed in this pull request?
This PR proposes to add lineSep option for a configurable line separator in text datasource.
It supports this option by using
LineRecordReader
's functionality with passing it to the constructor.The approach is similar with #20727; however, one main difference is, it uses text datasource's
lineSep
option to parse line by line in JSON's schema inference.How was this patch tested?
Manually tested and unit tests were added.