Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16548][SQL] Inconsistent error handling in JSON parsing SQL functions #17693

Closed
wants to merge 3 commits into from

Conversation

ewasserman
Copy link

What changes were proposed in this pull request?

change to using Jackson's com.fasterxml.jackson.core.JsonFactory

public JsonParser createParser(String content)

How was this patch tested?

existing unit tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Apr 20, 2017

@ewasserman Could you fix the title to point out the JIRA SPARK-16548? It seems SPARK-20314 duplicates that. Also, I guess we need a test case.

In terms of behaviour, I see @marmbrus and @srowen in that JIRA. Please let me cc.

@@ -149,7 +149,7 @@ case class GetJsonObject(json: Expression, path: Expression)

if (parsed.isDefined) {
try {
Utils.tryWithResource(jsonFactory.createParser(jsonStr.getBytes)) { parser =>
Utils.tryWithResource(jsonFactory.createParser(jsonStr.toString)) { parser =>
Copy link
Member

@HyukjinKwon HyukjinKwon Apr 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should catch the exception below rather than introducing unnecessary string instance apparently. (Sorry, I left a comment and removed back. I thought it was from_json/to_json).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check whether the other exceptions could be thrown by the illegal outputs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the jsonStr.getBytes unchanged.

    /**
     * Method for constructing parser for parsing
     * the contents of given byte array.
     * 
     * @since 2.1
     */
    public JsonParser createParser(byte[] data) throws IOException, JsonParseException {
        IOContext ctxt = _createContext(data, true);
        if (_inputDecorator != null) {
            InputStream in = _inputDecorator.decorate(ctxt, data, 0, data.length);
            if (in != null) {
                return _createParser(in, ctxt);
            }
        }
        return _createParser(data, 0, data.length, ctxt);
    }

I think we should capture both IOException and JsonParseException.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought JsonParseException extends IOException.

@gatorsmile
Copy link
Member

Could you add a test case in JsonExpressionsSuite?

@liancheng
Copy link
Contributor

Not suggesting doing it in this PR but maybe adding a SQL option to let the users choose the error handling strategy of all the JSON functions probably makes more sense here? The Spark JSON data source allows users to choose a parsing mode among:

  • PERMISSIVE: replacing malformed records with nulls,
  • DROPMALFORMED: drop malformed records, and
  • FAILFAST: report an error and abort once a malformed record is found.

@gatorsmile
Copy link
Member

@liancheng Good suggestion! Just like what we did for from_json/to_json.

@HyukjinKwon
Copy link
Member

I like the idea but I am not sure of DROPMALFORMED mode though. If we use an expression with the mode enabled, whole record (not only the column but all columns) will be dropped in some json expressions, probably not a generator expressions (did I understand correctly?).

I think we don't explicitly support parse modes in both from_json/to_json -

new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))

It sets FAILFAST but resembles PERMISSIVE mode up to my knowledge.

@@ -393,7 +393,7 @@ case class JsonTuple(children: Seq[Expression])
}

try {
Utils.tryWithResource(jsonFactory.createParser(json.getBytes)) {
Utils.tryWithResource(jsonFactory.createParser(json.toString)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change has performance penalty, we should just catch more exception below.

@ewasserman ewasserman changed the title [SPARK-20314][SQL] Inconsistent error handling in JSON parsing SQL functions [SPARK-16548][SQL] Inconsistent error handling in JSON parsing SQL functions Apr 21, 2017
@ewasserman
Copy link
Author

Reverted from use of toString on the org.apache.spark.unsafe.types.UTF8String by running the byte array through a java.io.Reader. This still fixes the bug and is also more efficient on the JSON parser side so it is a net performance win as well.

@@ -149,7 +149,8 @@ case class GetJsonObject(json: Expression, path: Expression)

if (parsed.isDefined) {
try {
Utils.tryWithResource(jsonFactory.createParser(jsonStr.getBytes)) { parser =>
Utils.tryWithResource(jsonFactory.createParser(new InputStreamReader(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some comments to say that, this is to avoid a bug in encoding detection, and we explicitly specify the encoding(UTF8) here.

@cloud-fan
Copy link
Contributor

LGTM

@gatorsmile
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Apr 22, 2017

Test build #76068 has finished for PR 17693 at commit f706ce3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 26, 2017

Test build #76153 has finished for PR 17693 at commit 91bb487.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Apr 26, 2017
…nctions

## What changes were proposed in this pull request?

change to using Jackson's `com.fasterxml.jackson.core.JsonFactory`

    public JsonParser createParser(String content)

## How was this patch tested?

existing unit tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Eric Wasserman <ericw@sgn.com>

Closes #17693 from ewasserman/SPARK-20314.

(cherry picked from commit 57e1da3)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.2

@asfgit asfgit closed this in 57e1da3 Apr 26, 2017
asfgit pushed a commit that referenced this pull request May 2, 2017
…nToStructs

## What changes were proposed in this pull request?

A fix for the same problem was made in #17693 but ignored `JsonToStructs`. This PR uses the same fix for `JsonToStructs`.

## How was this patch tested?

Regression test

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #17826 from brkyvz/SPARK-20549.

(cherry picked from commit 86174ea)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
ghost pushed a commit to dbtsai/spark that referenced this pull request May 2, 2017
…nToStructs

## What changes were proposed in this pull request?

A fix for the same problem was made in apache#17693 but ignored `JsonToStructs`. This PR uses the same fix for `JsonToStructs`.

## How was this patch tested?

Regression test

Author: Burak Yavuz <brkyvz@gmail.com>

Closes apache#17826 from brkyvz/SPARK-20549.
rdblue pushed a commit to rdblue/spark that referenced this pull request Apr 3, 2019
…nctions

## What changes were proposed in this pull request?

change to using Jackson's `com.fasterxml.jackson.core.JsonFactory`

    public JsonParser createParser(String content)

## How was this patch tested?

existing unit tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Eric Wasserman <ericw@sgn.com>

Closes apache#17693 from ewasserman/SPARK-20314.

(cherry picked from commit 57e1da3)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c8803c0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants