Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
…for json files
What changes were proposed in this pull request?
I propose new option for JSON datasource which allows to specify encoding (charset) of input and output files. Here is an example of using of the option:
If the option is not specified, charset auto-detection mechanism is used by default.
The option can be used for saving datasets to jsons. Currently Spark is able to save datasets into json files in
UTF-8
charset only. The changes allow to save data in any supported charset. Here is the approximate list of supported charsets by Oracle Java SE: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html . An user can specify the charset of output jsons via the charset option like.option("charset", "UTF-16BE")
. By default the output charset is stillUTF-8
to keep backward compatibility.The solution has the following restrictions for per-line mode (
multiline = false
):If charset is different from UTF-8, the lineSep option must be specified. The option required because Hadoop LineReader cannot detect the line separator correctly. Here is the ticket for solving the issue: https://issues.apache.org/jira/browse/SPARK-23725
Encoding with BOM are not supported. For example, the
UTF-16
andUTF-32
encodings are blacklisted. The problem can be solved by [SPARK-23723][SPARK-23724][SQL] Flexible format for the lineSep option of CSV datasource MaxGekk/spark#2How was this patch tested?
I added the following tests:
UTF-16LE
encoding with BOM inmultiline
modeUTF-32BE
with BOM)UTF-16LE
)UTF-32BE
and read the result by standard library (not by Spark)UTF-8
Author: Maxim Gekk maxim.gekk@databricks.com
Author: Maxim Gekk max.gekk@gmail.com
Closes apache#20937 from MaxGekk/json-encoding-line-sep.
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review http://spark.apache.org/contributing.html before opening a pull request.