Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23765][SQL] Supports custom line separator for json datasource #20877

Closed
wants to merge 3 commits into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to add lineSep option for a configurable line separator in text datasource.
It supports this option by using LineRecordReader's functionality with passing it to the constructor.

The approach is similar with #20727; however, one main difference is, it uses text datasource's lineSep option to parse line by line in JSON's schema inference.

How was this patch tested?

Manually tested and unit tests were added.

@@ -251,5 +254,8 @@ private[sql] class JacksonGenerator(
mapType = dataType.asInstanceOf[MapType]))
}

def writeLineEnding(): Unit = gen.writeRaw('\n')
def writeLineEnding(): Unit = {
// Note that JSON uses writer with UTF-8 charset. This string will be written out as UTF-8.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant here:

def createOutputStreamWriter(
context: JobContext,
file: Path,
charset: Charset = StandardCharsets.UTF_8): OutputStreamWriter = {
new OutputStreamWriter(createOutputStream(context, file), charset)
}

@@ -208,9 +208,11 @@ class TextSuite extends QueryTest with SharedSQLContext {
}
}

Seq("|", "^", "::", "!!!@3", 0x1E.toChar.toString).foreach { lineSep =>
// scalastyle:off nonascii
Seq("|", "^", "::", "!!!@3", 0x1E.toChar.toString, "아").foreach { lineSep =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly unrelated but I just added. I am fine with reverting this out if it bugs anyone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, "아" means just "ah" without any meaning ..

@HyukjinKwon
Copy link
Member Author

cc @cloud-fan, @MaxGekk and @hvanhovell, could you review this when you are available?

@SparkQA
Copy link

SparkQA commented Mar 22, 2018

Test build #88507 has finished for PR 20877 at commit c674f4a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2018

Test build #88509 has finished for PR 20877 at commit 6cbf1ac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@HyukjinKwon HyukjinKwon changed the title [SPARK-23765][SQL] Supports line separator for json datasource [SPARK-23765][SQL] Supports custom line separator for json datasource Mar 22, 2018
@SparkQA
Copy link

SparkQA commented Mar 22, 2018

Test build #88515 has finished for PR 20877 at commit 6cbf1ac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@MaxGekk
Copy link
Member

MaxGekk commented Mar 22, 2018

What about to make the option more flexible like in the PR: MaxGekk#1 ? It would be nice to handle JSON Streaming for example: https://en.wikipedia.org/wiki/JSON_streaming

@HyukjinKwon
Copy link
Member Author

I am neutral. Does that fix actual usecases? I can help review anyway. Would you like to make a followup separately?

@SparkQA
Copy link

SparkQA commented Mar 22, 2018

Test build #88518 has finished for PR 20877 at commit 6cbf1ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member

MaxGekk commented Mar 22, 2018

@HyukjinKwon We have a few clients who are interested in processing of JSON streaming like data. Here is the PR which combines your changes and mine: #20885

@MaxGekk
Copy link
Member

MaxGekk commented Mar 23, 2018

Does that fix actual usecases?

I see the following use cases:

  1. Jsons coming usually from embedded systems have not-standard separators (invisible in some cases). It is very convenient to open a file in hex editor and copy bytes between }{ to the lineSep option. This is the use case for the format with 'x' selector like: x0d 54 45

  2. In Json Streaming, records could be separated in pretty different ways. We should leave room for improvement I believe. See 'r' (for regexp) and '/' reserved selectors

  3. Some UTF-8 chars could cause errors from style (format) checkers. It is easier to represent such chars in hexadecimal format instead of disabling the checkers.

  4. In near future, json datasource will support input json in different charsets. If the source code in UTF-8 but input json in different charset, it is slightly hard to put such chars as values for the lineSep option. The x<hexs> format is more convenient here again.

@HyukjinKwon
Copy link
Member Author

Yup, yup. I don't object for now. Shall we merge this one first and talk more about it in your PR?
I believe this PR itself proposes a complete option and I saw many the requests for this feature here and there like mailing list.

@MaxGekk
Copy link
Member

MaxGekk commented Mar 23, 2018

I have only one concern: if we merge this PR, we close the possibility for changing format of lineSep and future extensions. Your changes allow any sequence of chars. It is not clear for me, how we can restrict it and assign different meanings to it in the future.

@HyukjinKwon
Copy link
Member Author

I think we are fine to change the behaviour of lineSep before the release ..

@HyukjinKwon
Copy link
Member Author

retest this please

@HyukjinKwon
Copy link
Member Author

@cloud-fan, @MaxGekk, and @hvanhovell, would you mind taking a look please when you have some time? I think this is pretty similar with #20727 except one difference that it uses text datasource's lineSep option to parse line by line in JSON's schema inference.

@SparkQA
Copy link

SparkQA commented Mar 25, 2018

Test build #88567 has finished for PR 20877 at commit 6cbf1ac.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 25, 2018

Test build #88568 has finished for PR 20877 at commit 6cbf1ac.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 25, 2018

Test build #88570 has finished for PR 20877 at commit 6cbf1ac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member

MaxGekk commented Mar 25, 2018

LGTM

@@ -268,6 +268,8 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
* `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
* <li>`multiLine` (default `false`): parse one record, which may span multiple lines,
* per file</li>
* <li>`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test case for testing the default covers \r, \r\n and \n?

@gatorsmile
Copy link
Member

@MaxGekk Will you submit a PR for addressing the comment #20877 (comment) in the next few weeks? If so, we can hold this PR.

@HyukjinKwon
Copy link
Member Author

He submitted this - #20885 and I believe we need more feedback and another review iteration.

@MaxGekk
Copy link
Member

MaxGekk commented Mar 25, 2018

@gatorsmile The PR has been already submitted: #20885 . Frankly speaking I would prefer another name for the option like we discussed before: MaxGekk#1 but similar PR for text datasource had been merged already: #20727 . And I think it is more important to have the same option across all datasource for now than arguing about option name . That's why I renamed recordDelimiter to lineSep in #20885 / cc @rxin

@HyukjinKwon
Copy link
Member Author

Correct me if I am wrong. None of renaming or adding more flexible functionality to the line separator blocks this PR, right?

Even if we go renaming, we should do it for text datasource too which I believe is better to do it separately, and the flexible functionality in the line separator looks needing more feedback and discussion.

@rxin
Copy link
Contributor

rxin commented Mar 25, 2018 via email

@SparkQA
Copy link

SparkQA commented Mar 25, 2018

Test build #88572 has finished for PR 20877 at commit f5e7d34.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Yeah. recordDelimiter is better based on the semantics.

@gatorsmile
Copy link
Member

Since both PRs are ready for review, let us review both and see which one is better

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Mar 26, 2018

There was a discussion about the naming here - #20727 (comment). I am against recordDelimiter.

Both PR deal with different problems. This PR deals with line separator only and that PR deals with line separator + flexible option.

@HyukjinKwon
Copy link
Member Author

If this one is merged, I believe it should be easier to review #20885 too.

@@ -85,6 +86,16 @@ private[sql] class JSONOptions(

val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)

val lineSeparator: Option[String] = parameters.get("lineSep").map { sep =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be private?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan
Copy link
Contributor

@HyukjinKwon do you wanna send a PR to add lineSep for CSV and fix the charset problem? thanks!

@asfgit asfgit closed this in 34c4b9c Mar 28, 2018
@HyukjinKwon
Copy link
Member Author

yea, I will. I'll be busy for a while but I will make it in the next week for sure.

@HyukjinKwon
Copy link
Member Author

thanks all for reviewing this.

mshtelma pushed a commit to mshtelma/spark that referenced this pull request Apr 5, 2018
## What changes were proposed in this pull request?

This PR proposes to add lineSep option for a configurable line separator in text datasource.
It supports this option by using `LineRecordReader`'s functionality with passing it to the constructor.

The approach is similar with apache#20727; however, one main difference is, it uses text datasource's `lineSep` option to parse line by line in JSON's schema inference.

## How was this patch tested?

Manually tested and unit tests were added.

Author: hyukjinkwon <gurwls223@apache.org>
Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#20877 from HyukjinKwon/linesep-json.
@HyukjinKwon
Copy link
Member Author

@MaxGekk, are you busy? Do you have some time to go for CSV's lineSep? I think I wouldn't have some time within a couple of weeks. If you have some time, I would appreciate if you could go ahead. Otherwise, I will try this one after a couple of weeks.

The problem in CSV's lineSep is about multiline support. As you might already know, CSV's multiline mode is different with JSON in a way it parses line by line from the stream whereas JSON treats it as a whole record in general - so we should set the lineSep to Univocity parser as well.

The problem is, lineSep at Univocity parser has some limitation (#18581 (comment) and see also https://github.com/uniVocity/univocity-parsers/issues/170).

There are some changes made in #18581 . Might able to extract CSV related change and make some addition and deletion.

If it's difficult to support lineSep more than one characters by the limitation, I think we can restrict the lineSep only to one character in multiLine mode.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 12, 2018

... are you busy? Do you have some time to go for CSV's lineSep?

@HyukjinKwon I will be on a vacation for 3 weeks but highly likely I will be in a place where there is no internet and even mobile networks at all. Yeh there are such places in Russia ;-) . So, even if I prepare a PR, I will be not able to response to any comments.

@HyukjinKwon
Copy link
Member Author

Ah, happy vacation!

@HyukjinKwon
Copy link
Member Author

@justinuang are you interested in taking over #20877 (comment) ?

@justinuang
Copy link

Sorry, I won't be able to take it over!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants