Update upstream #346

GulajavaMinistudio · 2018-04-29T15:27:22Z

…for json files

What changes were proposed in this pull request?

I propose new option for JSON datasource which allows to specify encoding (charset) of input and output files. Here is an example of using of the option:

spark.read.schema(schema)
  .option("multiline", "true")
  .option("encoding", "UTF-16LE")
  .json(fileName)

If the option is not specified, charset auto-detection mechanism is used by default.

The option can be used for saving datasets to jsons. Currently Spark is able to save datasets into json files in UTF-8 charset only. The changes allow to save data in any supported charset. Here is the approximate list of supported charsets by Oracle Java SE: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . An user can specify the charset of output jsons via the charset option like .option("charset", "UTF-16BE"). By default the output charset is still UTF-8 to keep backward compatibility.

The solution has the following restrictions for per-line mode (multiline = false):

If charset is different from UTF-8, the lineSep option must be specified. The option required because Hadoop LineReader cannot detect the line separator correctly. Here is the ticket for solving the issue: https://issues.apache.org/jira/browse/SPARK-23725
Encoding with BOM are not supported. For example, the UTF-16 and UTF-32 encodings are blacklisted. The problem can be solved by [SPARK-23723][SPARK-23724][SQL] Flexible format for the lineSep option of CSV datasource MaxGekk/spark#2

How was this patch tested?

I added the following tests:

reads an json file in UTF-16LE encoding with BOM in multiline mode
read json file by using charset auto detection (UTF-32BE with BOM)
read json file using of user's charset (UTF-16LE)
saving in UTF-32BE and read the result by standard library (not by Spark)
checking that default charset is UTF-8
handling wrong (unsupported) charset

Author: Maxim Gekk maxim.gekk@databricks.com
Author: Maxim Gekk max.gekk@gmail.com

Closes apache#20937 from MaxGekk/json-encoding-line-sep.

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

…for json files ## What changes were proposed in this pull request? I propose new option for JSON datasource which allows to specify encoding (charset) of input and output files. Here is an example of using of the option: ``` spark.read.schema(schema) .option("multiline", "true") .option("encoding", "UTF-16LE") .json(fileName) ``` If the option is not specified, charset auto-detection mechanism is used by default. The option can be used for saving datasets to jsons. Currently Spark is able to save datasets into json files in `UTF-8` charset only. The changes allow to save data in any supported charset. Here is the approximate list of supported charsets by Oracle Java SE: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . An user can specify the charset of output jsons via the charset option like `.option("charset", "UTF-16BE")`. By default the output charset is still `UTF-8` to keep backward compatibility. The solution has the following restrictions for per-line mode (`multiline = false`): - If charset is different from UTF-8, the lineSep option must be specified. The option required because Hadoop LineReader cannot detect the line separator correctly. Here is the ticket for solving the issue: https://issues.apache.org/jira/browse/SPARK-23725 - Encoding with [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) are not supported. For example, the `UTF-16` and `UTF-32` encodings are blacklisted. The problem can be solved by MaxGekk#2 ## How was this patch tested? I added the following tests: - reads an json file in `UTF-16LE` encoding with BOM in `multiline` mode - read json file by using charset auto detection (`UTF-32BE` with BOM) - read json file using of user's charset (`UTF-16LE`) - saving in `UTF-32BE` and read the result by standard library (not by Spark) - checking that default charset is `UTF-8` - handling wrong (unsupported) charset Author: Maxim Gekk <maxim.gekk@databricks.com> Author: Maxim Gekk <max.gekk@gmail.com> Closes #20937 from MaxGekk/json-encoding-line-sep.

GulajavaMinistudio merged commit c097b47 into GulajavaMinistudio:master Apr 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update upstream #346

Update upstream #346

GulajavaMinistudio commented Apr 29, 2018

Update upstream #346

Update upstream #346

Conversation

GulajavaMinistudio commented Apr 29, 2018

What changes were proposed in this pull request?

How was this patch tested?

What changes were proposed in this pull request?

How was this patch tested?