[SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files #20937

MaxGekk · 2018-03-29T14:54:59Z

What changes were proposed in this pull request?

I propose new option for JSON datasource which allows to specify encoding (charset) of input and output files. Here is an example of using of the option:

spark.read.schema(schema)
  .option("multiline", "true")
  .option("encoding", "UTF-16LE")
  .json(fileName)

If the option is not specified, charset auto-detection mechanism is used by default.

The option can be used for saving datasets to jsons. Currently Spark is able to save datasets into json files in UTF-8 charset only. The changes allow to save data in any supported charset. Here is the approximate list of supported charsets by Oracle Java SE: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . An user can specify the charset of output jsons via the charset option like .option("charset", "UTF-16BE"). By default the output charset is still UTF-8 to keep backward compatibility.

The solution has the following restrictions for per-line mode (multiline = false):

If charset is different from UTF-8, the lineSep option must be specified. The option required because Hadoop LineReader cannot detect the line separator correctly. Here is the ticket for solving the issue: https://issues.apache.org/jira/browse/SPARK-23725
Encoding with BOM are not supported. For example, the UTF-16 and UTF-32 encodings are blacklisted. The problem can be solved by [SPARK-23723][SPARK-23724][SQL] Flexible format for the lineSep option of CSV datasource MaxGekk/spark#2

How was this patch tested?

I added the following tests:

reads an json file in UTF-16LE encoding with BOM in multiline mode
read json file by using charset auto detection (UTF-32BE with BOM)
read json file using of user's charset (UTF-16LE)
saving in UTF-32BE and read the result by standard library (not by Spark)
checking that default charset is UTF-8
handling wrong (unsupported) charset

… only in the test

…er's charset

SparkQA · 2018-04-22T23:46:07Z

Test build #89691 has finished for PR 20937 at commit 482b799.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-23T03:36:05Z

python/pyspark/sql/readwriter.py

@@ -237,6 +237,9 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
        :param allowUnquotedControlChars: allows JSON Strings to contain unquoted control
                                          characters (ASCII characters with value less than 32,
                                          including tab and line feed characters) or not.
+        :param encoding: standard encoding (charset) name, for example UTF-8, UTF-16LE and UTF-32BE.
+                         If None is set, the encoding of input JSON will be detected automatically
+                         when the multiLine option is set to ``true``.


Does it mean users have to set the encoding if multiLine is false?

No, it doesn't. If it had been true, it would break backward compatibility. In the comment, we just want to highlight that encoding auto-detection (it means correct auto-detection in all cases) is officially supported in the multiLine mode only.

In per-line mode, the auto-detection mechanism (when encoding is not set) can fail in some cases, for example if actual encoding of json file is UTF-16 with BOM but in some case it works (file's encoding is UTF-8 and actual line separator \n for example). That's why @HyukjinKwon suggested to mention only working case.

cloud-fan · 2018-04-23T03:37:44Z

python/pyspark/sql/readwriter.py

@@ -773,6 +776,8 @@ def json(self, path, mode=None, compression=None, dateFormat=None, timestampForm
                                formats follow the formats at ``java.text.SimpleDateFormat``.
                                This applies to timestamp type. If None is set, it uses the
                                default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
+        :param encoding: specifies encoding (charset) of saved json files. If None is set,
+                        the default UTF-8 charset will be used.


shall we mention that, if encoding is set and not utf-8, lineSep also need to be set when multiLine is false?

Yes, we can mention this in the comment but the user will get the error: https://github.com/MaxGekk/spark-1/blob/482b79969b9e0cc475e63b415051b32423facef4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala#L116-L117 if the lineSep is not set.

Sorry, I didn't realized initially that the comment related to writing. For writing if lineSep is not set by user, it will be set to \n in any case: https://github.com/MaxGekk/spark-1/blob/482b79969b9e0cc475e63b415051b32423facef4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala#L124

Actually, the current implementation is more strict than it is needed. It requires to set lineSep explicitly in write if multiLine is false and encoding is different from UTF-8.

cloud-fan · 2018-04-23T03:42:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

+        s"""The ${enc} encoding must not be included in the blacklist when multiLine is disabled:
+           | ${blacklist.mkString(", ")}""".stripMargin)
+
+      val forcingLineSep = !(multiLine == false && enc != "UTF-8" && lineSeparator.isEmpty)


enc != "UTF-8", we should not compare string directly, but turn them into Charset

cloud-fan · 2018-04-23T03:45:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

+          """JSON parser cannot handle a character in its input.
+            |Specifying encoding as an input option explicitly might help to resolve the issue.
+            |""".stripMargin + e.getMessage
+        throw new CharConversionException(msg)


This will lose the original stack trace, we should something like

val newException = new CharConversionException(msg) newException.setStackStrace(e.getStrackTrace) throw newException

BTW we should also follow the existing rule and wrap the exception with BadRecordException. See the code above.

cloud-fan · 2018-04-23T03:52:49Z

LGTM except a few minor comments

…nto json-encoding-line-sep

SparkQA · 2018-04-23T22:36:04Z

Test build #89741 has finished for PR 20937 at commit a7be182.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-24T00:43:02Z

retest this please

SparkQA · 2018-04-24T04:36:47Z

Test build #89751 has finished for PR 20937 at commit a7be182.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-24T05:58:04Z

Seems fine but please allow me to take another look, which I will take within this weekend.

SparkQA · 2018-04-27T23:14:14Z

Test build #89938 has finished for PR 20937 at commit e0cebf4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HasCollectSubModels(Params):
class Summarizer(object):
class SummaryBuilder(JavaWrapper):
class CrossValidator(Estimator, ValidatorParams, HasParallelism, HasCollectSubModels,
class TrainValidationSplit(Estimator, ValidatorParams, HasParallelism, HasCollectSubModels,
case class Reverse(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
case class ArrayJoin(
case class ArrayMin(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
case class ArrayMax(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
case class ArrayPosition(left: Expression, right: Expression)
case class ElementAt(left: Expression, right: Expression) extends GetMapValueUtil
case class Concat(children: Seq[Expression]) extends Expression
case class Flatten(child: Expression) extends UnaryExpression
abstract class GetMapValueUtil extends BinaryExpression with ImplicitCastInputTypes
case class GetMapValue(child: Expression, key: Expression)
case class MonthsBetween(
trait QueryPlanConstraints extends ConstraintHelper
trait ConstraintHelper
class ArrayDataIndexedSeq[T](arrayData: ArrayData, dataType: DataType) extends IndexedSeq[T]
.doc(\"The class used to write checkpoint files atomically. This class must be a subclass \" +
case class CachedRDDBuilder(
case class InMemoryRelation(
trait CheckpointFileManager
sealed trait RenameHelperMethods
abstract class CancellableFSDataOutputStream(protected val underlyingStream: OutputStream)
sealed class RenameBasedFSDataOutputStream(
class FileSystemBasedCheckpointFileManager(path: Path, hadoopConf: Configuration)
class FileContextBasedCheckpointFileManager(path: Path, hadoopConf: Configuration)
case class WriteToContinuousDataSource(
case class WriteToContinuousDataSourceExec(writer: StreamWriter, query: SparkPlan)
abstract class MemoryStreamBase[A : Encoder](sqlContext: SQLContext) extends BaseStreamingSource
class ContinuousMemoryStream[A : Encoder](id: Int, sqlContext: SQLContext)
case class GetRecord(offset: ContinuousMemoryStreamPartitionOffset)
class ContinuousMemoryStreamDataReaderFactory(
class ContinuousMemoryStreamDataReader(
case class ContinuousMemoryStreamOffset(partitionNums: Map[Int, Int])
case class ContinuousMemoryStreamPartitionOffset(partition: Int, numProcessed: Int)

HyukjinKwon

LGTM otherwise!

HyukjinKwon · 2018-04-28T11:33:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala

@@ -43,7 +47,38 @@ private[sql] object CreateJacksonParser extends Serializable {
    jsonFactory.createParser(record.getBytes, 0, record.getLength)
  }

-  def inputStream(jsonFactory: JsonFactory, record: InputStream): JsonParser = {
-    jsonFactory.createParser(record)
+  def getStreamDecoder(enc: String, in: Array[Byte], length: Int): StreamDecoder = {


nit: private?

HyukjinKwon · 2018-04-28T11:36:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala

+  }
+
+  def inputStream(enc: String, jsonFactory: JsonFactory, is: InputStream): JsonParser = {
+    jsonFactory.createParser(new InputStreamReader(is, enc))


I think #20937 (comment) is a good investigation. It should be good to leave a small note that we should avoid this way if possible.

I added a comment above

HyukjinKwon · 2018-04-28T11:41:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

+        s"""The ${enc} encoding must not be included in the blacklist when multiLine is disabled:
+           | ${blacklist.mkString(", ")}""".stripMargin)
+
+      val forcingLineSep = !(multiLine == false &&


forcingLineSep -> things like ... isLineSepRequired?

HyukjinKwon · 2018-04-28T11:43:40Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

@@ -372,6 +372,9 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
   * `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
   * <li>`multiLine` (default `false`): parse one record, which may span multiple lines,
   * per file</li>
+   * <li>`encoding` (by default it is not set): allows to forcibly set one of standard basic


Not a big deal but shall we match the description to Python side?

I updated python's comment to make it the same as here

HyukjinKwon · 2018-04-28T11:50:24Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmarks.scala

+    val benchmark = new Benchmark("JSON schema inferring", rowsNum)
+
+    withTempPath { path =>
+      // scalastyle:off


// scalastyle:off println ... // scalastyle:on println

HyukjinKwon · 2018-04-28T12:13:05Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+    }
+  }
+
+  test("SPARK-23094: invalid json with leading nulls - from dataset") {


let's set PERMISSIVE explicitly and add this fact to this test title too.

HyukjinKwon · 2018-04-28T12:13:27Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+    }
+  }
+
+  test("SPARK-23094: invalid json with leading nulls - from file (multiLine=false)") {


HyukjinKwon · 2018-04-28T12:13:32Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+
+  private val badJson = "\u0000\u0000\u0000A\u0001AAA"
+
+  test("SPARK-23094: invalid json with leading nulls - from file (multiLine=true)") {


HyukjinKwon · 2018-04-28T12:17:19Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+    assert(exception.getMessage == encoding)
+  }
+
+  test("SPARK-23723: read written json in UTF-16LE") {


Test tile like ... read back or roundtrip in read and write?

HyukjinKwon · 2018-04-28T12:18:46Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+      ).repartition(2)
+      ds.write
+        .options(options)
+        .format("json").mode("overwrite")


ditto for overwrite

SparkQA · 2018-04-28T23:11:25Z

Test build #89962 has finished for PR 20937 at commit d3d28aa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-04-29T03:19:37Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+    (12, "===", "US-ASCII", false),
+    (13, "$^+", "utf-32le", true)
+  ).foreach {
+    case (testNum, sep, encoding, inferSchema) => checkReadJson(sep, encoding, inferSchema, testNum)


foreach { case (testNum, sep, encoding, inferSchema) => ... }

This is actually a style - https://github.com/databricks/scala-style-guide#pattern-matching

not a big deal

HyukjinKwon · 2018-04-29T03:22:21Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+  }
+
+  def checkEncoding(expectedEncoding: String, pathToJsonFiles: String,
+      expectedContent: String): Unit = {


I think it should be

def checkEncoding( expectedEncoding: String, pathToJsonFiles: String, expectedContent: String): Unit = {

per https://github.com/databricks/scala-style-guide#spacing-and-indentation

or

def checkEncoding( expectedEncoding: String, pathToJsonFiles: String, expectedContent: String): Unit = {

if it fits per databricks/scala-style-guide#58 (comment)

Not a big deal

HyukjinKwon · 2018-04-29T03:23:21Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmarks.scala

+import org.apache.spark.util.{Benchmark, Utils}
+
+/**
+ * The benchmarks aims to measure performance of JSON parsing when encoding is set and isn't.


I usually avoid abbreviation in the doc tho.

HyukjinKwon · 2018-04-29T03:23:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala

+  }
+
+  private def createParser(enc: String, jsonFactory: JsonFactory,
+      stream: PortableDataStream): JsonParser = {


ditto for style

HyukjinKwon · 2018-04-29T03:25:42Z

Merged to master !!!

HyukjinKwon · 2018-04-29T03:27:20Z

It doesn't necessarily make a followup for styles but it should be good to remember those when we review related PRs next time.

Thanks for bearing with me all here.

MaxGekk added 30 commits March 17, 2018 12:39

Test for reading json in UTF-16 with BOM

b2e92b4

Use user's charset or autodetect it if the charset is not specified

cb2f27b

Added a type and a comment for charset

0d45fd3

Replacing the monadic chaining by matching because it is more readable

1fb9b32

Keeping the old method for backward compatibility

c3b04ee

testFile is moved into the test to make more local because it is used…

93d3879

… only in the test

Adding the charset as third parameter to the text method

15798a1

Removing whitespaces at the end of the line

cc05ce9

Fix the comment in javadoc style

74f2026

Simplifying of the UTF-16 test

4856b8e

A hint to the exception how to set the charset explicitly

084f41f

Fix for scala style checks

31cd793

Run tests again

6eacd18

Improving of the exception message

3b4a509

Appended the original message to the exception

cd1124e

Multi-line reading of json file in utf-32

ebf5390

Autodetect charset of jsons in the multiline mode

c5b6a35

Test for reading a json in UTF-16LE in the multiline mode by using us…

ef5e6c6

…er's charset

Fix test: rename the test file - utf32be -> utf32BE

f9b6ad1

Fix code style

3b7714c

Appending the create verb to the method for readability

edb9167

Making the createParser as a separate private method

5ba2881

Fix code style

1509e10

Checks the charset option is supported

e3184b3

Support charset as a parameter of the json method

87d259c

Test for charset different from utf-8

76c1d08

Description of the charset option of the json method

88395b5

Minor changes in comments: added . at the end of a sentence

f2f8ae7

Added a test for wrong charset name

b451a03

Testing that charset in any case is acceptable

c13c159

Addressing Hyukjin Kwon's review comments

482b799

cloud-fan reviewed Apr 23, 2018

View reviewed changes

MaxGekk added 2 commits April 23, 2018 22:35

Addressing Wenchen Fan's review comments

a0ab98b

Merge branch 'json-encoding-line-sep' of github.com:MaxGekk/spark-1 i…

a7be182

…nto json-encoding-line-sep

Merge remote-tracking branch 'origin/master' into json-encoding-line-sep

e0cebf4

HyukjinKwon approved these changes Apr 28, 2018

View reviewed changes

Addressing Hyukjin Kwon's review comments

d3d28aa

HyukjinKwon reviewed Apr 29, 2018

View reviewed changes

asfgit closed this in bd14da6 Apr 29, 2018

GulajavaMinistudio mentioned this pull request Apr 29, 2018

Update upstream GulajavaMinistudio/spark#346

Merged

gatorsmile mentioned this pull request May 7, 2018

[SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] Support custom encoding for json files #21254

Closed

MaxGekk mentioned this pull request Nov 1, 2018

[SPARK-25931][SQL] Benchmarking creation of Jackson parser #22920

Closed

MaxGekk deleted the json-encoding-line-sep branch August 17, 2019 13:34


		private val badJson = "\u0000\u0000\u0000A\u0001AAA"

		test("SPARK-23094: invalid json with leading nulls - from file (multiLine=true)") {

[SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files #20937

[SPARK-23094][SPARK-23723][SPARK-23724][SQL] Support custom encoding for json files #20937

Conversation

MaxGekk commented Mar 29, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Apr 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaxGekk Apr 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Apr 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 23, 2018

SparkQA commented Apr 23, 2018

HyukjinKwon commented Apr 24, 2018

SparkQA commented Apr 24, 2018

HyukjinKwon commented Apr 24, 2018

SparkQA commented Apr 27, 2018

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Apr 29, 2018

HyukjinKwon commented Apr 29, 2018

MaxGekk commented Mar 29, 2018 •

edited

Loading

cloud-fan Apr 23, 2018 •

edited

Loading

MaxGekk Apr 23, 2018 •

edited

Loading

cloud-fan Apr 23, 2018 •

edited

Loading