[SPARK-23765][SQL] Supports custom line separator for json datasource #20877

HyukjinKwon · 2018-03-22T07:10:47Z

What changes were proposed in this pull request?

This PR proposes to add lineSep option for a configurable line separator in text datasource.
It supports this option by using LineRecordReader's functionality with passing it to the constructor.

The approach is similar with #20727; however, one main difference is, it uses text datasource's lineSep option to parse line by line in JSON's schema inference.

How was this patch tested?

Manually tested and unit tests were added.

HyukjinKwon · 2018-03-22T07:12:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala

@@ -251,5 +254,8 @@ private[sql] class JacksonGenerator(
      mapType = dataType.asInstanceOf[MapType]))
  }

-  def writeLineEnding(): Unit = gen.writeRaw('\n')
+  def writeLineEnding(): Unit = {
+    // Note that JSON uses writer with UTF-8 charset. This string will be written out as UTF-8.


I meant here:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala

Lines 88 to 93 in de36f65

def createOutputStreamWriter(

context: JobContext,

file: Path,

charset: Charset = StandardCharsets.UTF_8): OutputStreamWriter = {

new OutputStreamWriter(createOutputStream(context, file), charset)

}

HyukjinKwon · 2018-03-22T07:13:09Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala

@@ -208,9 +208,11 @@ class TextSuite extends QueryTest with SharedSQLContext {
    }
  }

-  Seq("|", "^", "::", "!!!@3", 0x1E.toChar.toString).foreach { lineSep =>
+  // scalastyle:off nonascii
+  Seq("|", "^", "::", "!!!@3", 0x1E.toChar.toString, "아").foreach { lineSep =>


Strictly unrelated but I just added. I am fine with reverting this out if it bugs anyone.

BTW, "아" means just "ah" without any meaning ..

HyukjinKwon · 2018-03-22T07:19:36Z

cc @cloud-fan, @MaxGekk and @hvanhovell, could you review this when you are available?

SparkQA · 2018-03-22T08:00:16Z

Test build #88507 has finished for PR 20877 at commit c674f4a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-22T08:02:47Z

Test build #88509 has finished for PR 20877 at commit 6cbf1ac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-22T08:31:57Z

retest this please

SparkQA · 2018-03-22T11:08:09Z

Test build #88515 has finished for PR 20877 at commit 6cbf1ac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-22T11:24:52Z

retest this please

MaxGekk · 2018-03-22T11:26:30Z

What about to make the option more flexible like in the PR: MaxGekk#1 ? It would be nice to handle JSON Streaming for example: https://en.wikipedia.org/wiki/JSON_streaming

HyukjinKwon · 2018-03-22T11:42:39Z

I am neutral. Does that fix actual usecases? I can help review anyway. Would you like to make a followup separately?

SparkQA · 2018-03-22T14:58:51Z

Test build #88518 has finished for PR 20877 at commit 6cbf1ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-03-22T23:10:48Z

@HyukjinKwon We have a few clients who are interested in processing of JSON streaming like data. Here is the PR which combines your changes and mine: #20885

MaxGekk · 2018-03-23T10:55:26Z

Does that fix actual usecases?

I see the following use cases:

Jsons coming usually from embedded systems have not-standard separators (invisible in some cases). It is very convenient to open a file in hex editor and copy bytes between }{ to the lineSep option. This is the use case for the format with 'x' selector like: x0d 54 45
In Json Streaming, records could be separated in pretty different ways. We should leave room for improvement I believe. See 'r' (for regexp) and '/' reserved selectors
Some UTF-8 chars could cause errors from style (format) checkers. It is easier to represent such chars in hexadecimal format instead of disabling the checkers.
In near future, json datasource will support input json in different charsets. If the source code in UTF-8 but input json in different charset, it is slightly hard to put such chars as values for the lineSep option. The x<hexs> format is more convenient here again.

HyukjinKwon · 2018-03-23T12:13:52Z

Yup, yup. I don't object for now. Shall we merge this one first and talk more about it in your PR?
I believe this PR itself proposes a complete option and I saw many the requests for this feature here and there like mailing list.

MaxGekk · 2018-03-23T12:33:31Z

I have only one concern: if we merge this PR, we close the possibility for changing format of lineSep and future extensions. Your changes allow any sequence of chars. It is not clear for me, how we can restrict it and assign different meanings to it in the future.

HyukjinKwon · 2018-03-23T13:31:53Z

I think we are fine to change the behaviour of lineSep before the release ..

HyukjinKwon · 2018-03-25T06:18:05Z

retest this please

HyukjinKwon · 2018-03-25T06:19:16Z

@cloud-fan, @MaxGekk, and @hvanhovell, would you mind taking a look please when you have some time? I think this is pretty similar with #20727 except one difference that it uses text datasource's lineSep option to parse line by line in JSON's schema inference.

SparkQA · 2018-03-25T07:05:01Z

Test build #88567 has finished for PR 20877 at commit 6cbf1ac.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-25T07:16:57Z

retest this please

SparkQA · 2018-03-25T08:03:12Z

Test build #88568 has finished for PR 20877 at commit 6cbf1ac.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-25T08:07:44Z

retest this please

SparkQA · 2018-03-25T11:39:20Z

Test build #88570 has finished for PR 20877 at commit 6cbf1ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-03-25T12:42:38Z

LGTM

gatorsmile · 2018-03-25T16:29:15Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

@@ -268,6 +268,8 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
   * `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
   * <li>`multiLine` (default `false`): parse one record, which may span multiple lines,
   * per file</li>
+   * <li>`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator


Add a test case for testing the default covers \r, \r\n and \n?

gatorsmile · 2018-03-25T16:37:13Z

@MaxGekk Will you submit a PR for addressing the comment #20877 (comment) in the next few weeks? If so, we can hold this PR.

HyukjinKwon · 2018-03-25T17:29:58Z

He submitted this - #20885 and I believe we need more feedback and another review iteration.

MaxGekk · 2018-03-25T17:36:23Z

@gatorsmile The PR has been already submitted: #20885 . Frankly speaking I would prefer another name for the option like we discussed before: MaxGekk#1 but similar PR for text datasource had been merged already: #20727 . And I think it is more important to have the same option across all datasource for now than arguing about option name . That's why I renamed recordDelimiter to lineSep in #20885 / cc @rxin

HyukjinKwon · 2018-03-25T17:49:25Z

Correct me if I am wrong. None of renaming or adding more flexible functionality to the line separator blocks this PR, right?

Even if we go renaming, we should do it for text datasource too which I believe is better to do it separately, and the flexible functionality in the line separator looks needing more feedback and discussion.

rxin · 2018-03-25T17:50:03Z

We can also change both if they haven’t been released yet.

…

On Sun, Mar 25, 2018 at 10:37 AM Maxim Gekk ***@***.***> wrote: @gatorsmile <https://github.com/gatorsmile> The PR has been already submitted: #20885 <#20885> . Frankly speaking I would prefer another name for the option like we discussed before: MaxGekk#1 <MaxGekk#1> but similar PR for text datasource had been merged already: #20727 <#20727> . And I think it is more important to have the same option across all datasource. That's why I renamed *recordDelimiter* to *lineSep* in #20885 <#20885> / cc @rxin <https://github.com/rxin> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#20877 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AATvPKz5R1mF_QZcR0qPO-OBRoGZ3vIEks5th9XQgaJpZM4S2jpk> .

SparkQA · 2018-03-25T21:18:16Z

Test build #88572 has finished for PR 20877 at commit f5e7d34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-03-25T23:10:30Z

Yeah. recordDelimiter is better based on the semantics.

gatorsmile · 2018-03-25T23:12:03Z

Since both PRs are ready for review, let us review both and see which one is better

HyukjinKwon · 2018-03-26T01:03:45Z

There was a discussion about the naming here - #20727 (comment). I am against recordDelimiter.

Both PR deal with different problems. This PR deals with line separator only and that PR deals with line separator + flexible option.

HyukjinKwon · 2018-03-26T01:11:56Z

If this one is merged, I believe it should be easier to review #20885 too.

cloud-fan · 2018-03-28T11:43:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

@@ -85,6 +86,16 @@ private[sql] class JSONOptions(

  val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)

+  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep =>


this can be private?

cloud-fan · 2018-03-28T11:48:52Z

thanks, merging to master!

cloud-fan · 2018-03-28T11:49:38Z

@HyukjinKwon do you wanna send a PR to add lineSep for CSV and fix the charset problem? thanks!

HyukjinKwon · 2018-03-28T13:02:23Z

yea, I will. I'll be busy for a while but I will make it in the next week for sure.

HyukjinKwon · 2018-03-28T13:03:06Z

thanks all for reviewing this.

## What changes were proposed in this pull request? This PR proposes to add lineSep option for a configurable line separator in text datasource. It supports this option by using `LineRecordReader`'s functionality with passing it to the constructor. The approach is similar with apache#20727; however, one main difference is, it uses text datasource's `lineSep` option to parse line by line in JSON's schema inference. ## How was this patch tested? Manually tested and unit tests were added. Author: hyukjinkwon <gurwls223@apache.org> Author: hyukjinkwon <gurwls223@gmail.com> Closes apache#20877 from HyukjinKwon/linesep-json.

HyukjinKwon · 2018-10-12T02:25:10Z

@MaxGekk, are you busy? Do you have some time to go for CSV's lineSep? I think I wouldn't have some time within a couple of weeks. If you have some time, I would appreciate if you could go ahead. Otherwise, I will try this one after a couple of weeks.

The problem in CSV's lineSep is about multiline support. As you might already know, CSV's multiline mode is different with JSON in a way it parses line by line from the stream whereas JSON treats it as a whole record in general - so we should set the lineSep to Univocity parser as well.

The problem is, lineSep at Univocity parser has some limitation (#18581 (comment) and see also https://github.com/uniVocity/univocity-parsers/issues/170).

There are some changes made in #18581 . Might able to extract CSV related change and make some addition and deletion.

If it's difficult to support lineSep more than one characters by the limitation, I think we can restrict the lineSep only to one character in multiLine mode.

MaxGekk · 2018-10-12T08:49:14Z

... are you busy? Do you have some time to go for CSV's lineSep?

@HyukjinKwon I will be on a vacation for 3 weeks but highly likely I will be in a place where there is no internet and even mobile networks at all. Yeh there are such places in Russia ;-) . So, even if I prepare a PR, I will be not able to response to any comments.

HyukjinKwon · 2018-10-12T09:50:00Z

Ah, happy vacation!

HyukjinKwon · 2018-10-14T08:10:43Z

@justinuang are you interested in taking over #20877 (comment) ?

justinuang · 2018-11-27T20:32:56Z

Sorry, I won't be able to take it over!

HyukjinKwon and others added 2 commits March 22, 2018 16:04

Supports line separator for json datasource

c674f4a

Consistent variable name

6cbf1ac

HyukjinKwon commented Mar 22, 2018

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-23765][SQL] Supports line separator for json datasource~~ [SPARK-23765][SQL] Supports custom line separator for json datasource Mar 22, 2018

gatorsmile reviewed Mar 25, 2018

View reviewed changes

Address a comment

f5e7d34

HyukjinKwon mentioned this pull request Mar 27, 2018

[SPARK-23723] New charset option for json datasource #20849

Closed

cloud-fan reviewed Mar 28, 2018

View reviewed changes

asfgit closed this in 34c4b9c Mar 28, 2018

HyukjinKwon mentioned this pull request Oct 14, 2018

[SPARK-25493][SQL] Use auto-detection for CRLF in CSV datasource multiline mode #22503

Closed

HyukjinKwon deleted the linesep-json branch October 16, 2018 12:45

HyukjinKwon mentioned this pull request Nov 5, 2018

[SPARK-21289][SQL][ML] Supports custom line separator for all text-based datasources #18581

Closed

	def createOutputStreamWriter(
	context: JobContext,
	file: Path,
	charset: Charset = StandardCharsets.UTF_8): OutputStreamWriter = {
	new OutputStreamWriter(createOutputStream(context, file), charset)
	}

		@@ -85,6 +86,16 @@ private[sql] class JSONOptions(

		val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)

		val lineSeparator: Option[String] = parameters.get("lineSep").map { sep =>

[SPARK-23765][SQL] Supports custom line separator for json datasource #20877

[SPARK-23765][SQL] Supports custom line separator for json datasource #20877

Conversation

HyukjinKwon commented Mar 22, 2018

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon Mar 22, 2018

Choose a reason for hiding this comment

HyukjinKwon Mar 22, 2018

Choose a reason for hiding this comment

HyukjinKwon Mar 22, 2018

Choose a reason for hiding this comment

HyukjinKwon commented Mar 22, 2018

SparkQA commented Mar 22, 2018

SparkQA commented Mar 22, 2018

HyukjinKwon commented Mar 22, 2018

SparkQA commented Mar 22, 2018

HyukjinKwon commented Mar 22, 2018

MaxGekk commented Mar 22, 2018

HyukjinKwon commented Mar 22, 2018

SparkQA commented Mar 22, 2018

MaxGekk commented Mar 22, 2018

MaxGekk commented Mar 23, 2018

HyukjinKwon commented Mar 23, 2018

MaxGekk commented Mar 23, 2018

HyukjinKwon commented Mar 23, 2018

HyukjinKwon commented Mar 25, 2018

HyukjinKwon commented Mar 25, 2018

SparkQA commented Mar 25, 2018

HyukjinKwon commented Mar 25, 2018

SparkQA commented Mar 25, 2018

HyukjinKwon commented Mar 25, 2018

SparkQA commented Mar 25, 2018

MaxGekk commented Mar 25, 2018

gatorsmile Mar 25, 2018

Choose a reason for hiding this comment

gatorsmile commented Mar 25, 2018

HyukjinKwon commented Mar 25, 2018

MaxGekk commented Mar 25, 2018 • edited Loading

HyukjinKwon commented Mar 25, 2018

rxin commented Mar 25, 2018 via email

SparkQA commented Mar 25, 2018

gatorsmile commented Mar 25, 2018

gatorsmile commented Mar 25, 2018

HyukjinKwon commented Mar 26, 2018 • edited Loading

HyukjinKwon commented Mar 26, 2018

cloud-fan Mar 28, 2018

Choose a reason for hiding this comment

cloud-fan commented Mar 28, 2018

cloud-fan commented Mar 28, 2018

HyukjinKwon commented Mar 28, 2018

HyukjinKwon commented Mar 28, 2018

HyukjinKwon commented Oct 12, 2018

MaxGekk commented Oct 12, 2018

HyukjinKwon commented Oct 12, 2018

HyukjinKwon commented Oct 14, 2018

justinuang commented Nov 27, 2018

MaxGekk commented Mar 25, 2018 •

edited

Loading

HyukjinKwon commented Mar 26, 2018 •

edited

Loading