Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19104][SQL] Lambda variables in ExternalMapToCatalyst should be global #18418

Closed
wants to merge 3 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Jun 26, 2017

What changes were proposed in this pull request?

The issue happens in ExternalMapToCatalyst. For example, the following codes create ExternalMapToCatalyst to convert Scala Map to catalyst map format.

val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100))))
val ds = spark.createDataset(data)

The valueConverter in ExternalMapToCatalyst looks like:

if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value)

There is a CreateNamedStruct expression (named_struct) to create a row of InnerData.name and InnerData.value that are referred by ExternalMapToCatalyst_value52.

Because ExternalMapToCatalyst_value52 are local variable, when CreateNamedStruct splits expressions to individual functions, the local variable can't be accessed anymore.

How was this patch tested?

Jenkins tests.

@SparkQA
Copy link

SparkQA commented Jun 26, 2017

Test build #78599 has finished for PR 18418 at commit c417e22.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 26, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 26, 2017

Test build #78601 has finished for PR 18418 at commit bd0221a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 26, 2017

Test build #78609 has finished for PR 18418 at commit bd0221a.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 26, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 26, 2017

Test build #78613 has finished for PR 18418 at commit bd0221a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 26, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 26, 2017

Test build #78625 has finished for PR 18418 at commit bd0221a.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 26, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 26, 2017

Test build #78635 has finished for PR 18418 at commit bd0221a.

  • This patch fails due to an unknown error code, -10.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -736,7 +736,27 @@ class CodegenContext {
// Cannot split these expressions because they are not created from a row object.
return expressions.mkString("\n")
}
splitExpressions(expressions, "apply", ("InternalRow", row) :: Nil)
splitExpressions(row, expressions, Seq.empty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the check in lines 735-738? Now, we do the same check at lines 754-757, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks.

@viirya
Copy link
Member Author

viirya commented Jun 27, 2017

cc @cloud-fan

// of the functions.
val (valExprCodes, valExprParams) = valExprs.map { expr =>
val exprCode = expr.genCode(ctx)
val lambdaVars = expr.collect {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks hacky, and why you only do it for CreateNamedStruct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, CreateArray is another expression using splitExpressions. However, if the external type is an array, MapObjects will be used to construct the internal array. CreateMap is the same, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC there are a lot of places we use splitExpressions, shall we check all of them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very sure about other places. I move the collecting of lambda variable to splitExpressions and let the possible places to do this check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach doesn't work well after rethinking about this and more experiments.

A simpler approach is letting those lambda variables global as MapObjectsandCollectObjectsToMap` do.

@SparkQA
Copy link

SparkQA commented Jun 27, 2017

Test build #78666 has finished for PR 18418 at commit f4cb190.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 27, 2017

Test build #78678 has finished for PR 18418 at commit d6161d5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

can you put more detailed information about what is the bug? I don't understand why it only happens for map inside case class.

@viirya viirya force-pushed the SPARK-19104 branch 2 times, most recently from eb061da to 65d8873 Compare June 27, 2017 09:27
Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think this is the right fix according to how we deal with this kind of problem before. LGTM

@cloud-fan
Copy link
Contributor

BTW please update PR tile to mention that it's a bug fix for ExternalMapToCatalyst

@viirya viirya changed the title [SPARK-19104][SQL] Lambda variables should work when parent expression splits generated codes [SPARK-19104][SQL] Lambda variables in ExternalMapToCatalyst should be global Jun 27, 2017
@viirya
Copy link
Member Author

viirya commented Jun 27, 2017

@cloud-fan Thanks. The PR title and description are both updated to reflect the change.

@SparkQA
Copy link

SparkQA commented Jun 27, 2017

Test build #78690 has finished for PR 18418 at commit 65d8873.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 27, 2017

Test build #78688 has finished for PR 18418 at commit eb061da.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -354,4 +357,9 @@ class DatasetPrimitiveSuite extends QueryTest with SharedSQLContext {
checkDataset(Seq(PackageClass(1)).toDS(), PackageClass(1))
}

test("SPARK-19104: lambda variables should work when parent expression splits generated codes") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should also update the test name...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

miss it. updated.

@SparkQA
Copy link

SparkQA commented Jun 27, 2017

Test build #78698 has finished for PR 18418 at commit cffb09b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 27, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 27, 2017

Test build #78701 has finished for PR 18418 at commit cffb09b.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

the test failure is unrelated, merging to master/2.2!

@asfgit asfgit closed this in fd8c931 Jun 27, 2017
asfgit pushed a commit that referenced this pull request Jun 27, 2017
…e global

The issue happens in `ExternalMapToCatalyst`. For example, the following codes create `ExternalMapToCatalyst` to convert Scala Map to catalyst map format.

    val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100))))
    val ds = spark.createDataset(data)

The `valueConverter` in `ExternalMapToCatalyst` looks like:

    if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value)

There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`.

Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore.

Jenkins tests.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #18418 from viirya/SPARK-19104.

(cherry picked from commit fd8c931)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 29, 2017
…e global

## What changes were proposed in this pull request?

The issue happens in `ExternalMapToCatalyst`. For example, the following codes create `ExternalMapToCatalyst` to convert Scala Map to catalyst map format.

    val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100))))
    val ds = spark.createDataset(data)

The `valueConverter` in `ExternalMapToCatalyst` looks like:

    if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value)

There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`.

Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore.

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#18418 from viirya/SPARK-19104.
asfgit pushed a commit that referenced this pull request Jul 18, 2017
…alyst should be global

## What changes were proposed in this pull request?

This PR is backport of #18418 to Spark 2.1. [SPARK-21391](https://issues.apache.org/jira/browse/SPARK-21391) reported this problem in Spark 2.1.

The issue happens in `ExternalMapToCatalyst`. For example, the following codes create ExternalMap`ExternalMapToCatalyst`ToCatalyst to convert Scala Map to catalyst map format.

```
val data = Seq.tabulate(10)(i => NestedData(1, Map("key" -> InnerData("name", i + 100))))
val ds = spark.createDataset(data)
```
The `valueConverter` in `ExternalMapToCatalyst` looks like:

```
if (isnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true))) null else named_struct(name, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).name, true), value, assertnotnull(lambdavariable(ExternalMapToCatalyst_value52, ExternalMapToCatalyst_value_isNull52, ObjectType(class org.apache.spark.sql.InnerData), true)).value)
```
There is a `CreateNamedStruct` expression (`named_struct`) to create a row of `InnerData.name` and `InnerData.value` that are referred by `ExternalMapToCatalyst_value52`.

Because `ExternalMapToCatalyst_value52` are local variable, when `CreateNamedStruct` splits expressions to individual functions, the local variable can't be accessed anymore.

## How was this patch tested?

Added a new test suite into `DatasetPrimitiveSuite`

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #18627 from kiszk/SPARK-21391.
@viirya viirya deleted the SPARK-19104 branch December 27, 2023 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants