[SPARK-17701][SQL] Refactor RowDataSourceScanExec so its sameResult call does not compare strings #18600

cloud-fan · 2017-07-11T13:50:43Z

What changes were proposed in this pull request?

Currently, RowDataSourceScanExec and FileSourceScanExec rely on a "metadata" string map to implement equality comparison, since the RDDs they depend on cannot be directly compared. This has resulted in a number of correctness bugs around exchange reuse, e.g. SPARK-17673 and SPARK-16818.

To make these comparisons less brittle, we should refactor these classes to compare constructor parameters directly instead of relying on the metadata map.

This PR refactors RowDataSourceScanExec, FileSourceScanExec will be fixed in the follow-up PR.

How was this patch tested?

existing tests

…trings

cloud-fan · 2017-07-11T13:51:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

-    case r: HadoopFsRelation if r.fileFormat.isInstanceOf[ParquetSource] =>
-      !SparkSession.getActiveSession.get.sessionState.conf.getConf(
-        SQLConf.PARQUET_VECTORIZED_READER_ENABLED)
-    case _: HadoopFsRelation => true


HadoopFsRelation never goes into RowDataSourceScanExec

cloud-fan · 2017-07-11T13:51:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

@@ -395,25 +367,33 @@ case class DataSourceStrategy(conf: SQLConf) extends Strategy with Logging with
        .asInstanceOf[Seq[Attribute]]
        // Match original case of attributes.
        .map(relation.attributeMap)
-        // Don't request columns that are only referenced by pushed filters.
-        .filterNot(handledSet.contains)


in this branch, filterSet is a subset of projectSet, so it's a no-op.

cloud-fan · 2017-07-11T13:52:11Z

cc @ericl @gatorsmile

SparkQA · 2017-07-11T15:55:03Z

Test build #79523 has finished for PR 18600 at commit 5008eb6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-11T16:05:00Z

retest this please

SparkQA · 2017-07-11T18:35:22Z

Test build #79527 has finished for PR 18600 at commit 5008eb6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-07-11T20:42:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

  extends DataSourceScanExec {

+  def output: Seq[Attribute] = requiredColumnsIndex.map(fullOutput)


def output: Seq[Attribute] -> override def output: Seq[Attribute]?

dongjoon-hyun · 2017-07-11T20:45:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

        scanBuilder(requestedColumns, candidatePredicates, pushedFilters),
-        relation.relation, UnknownPartitioning(0), metadata,
-        relation.catalogTable.map(_.identifier))
+        relation.relation, relation.catalogTable.map(_.identifier))


nit; can we make this into two lines during refactoring?

relation.relation, relation.catalogTable.map(_.identifier))

dongjoon-hyun · 2017-07-11T20:45:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

        scanBuilder(requestedColumns, candidatePredicates, pushedFilters),
-        relation.relation, UnknownPartitioning(0), metadata,
-        relation.catalogTable.map(_.identifier))
+        relation.relation, relation.catalogTable.map(_.identifier))


gatorsmile · 2017-07-12T05:19:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

-    output: Seq[Attribute],
+    fullOutput: Seq[Attribute],
+    requiredColumnsIndex: Seq[Int],
+    filters: Set[Filter],


Start it in this PR?

gatorsmile · 2017-07-12T05:50:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

    rdd: RDD[InternalRow],
    @transient relation: BaseRelation,
-    override val outputPartitioning: Partitioning,
-    override val metadata: Map[String, String],


uh... This is not being used after our previous refactoring.

metadata is still needed. It is being used here.

gatorsmile · 2017-07-12T06:05:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

    // Combines all Catalyst filter `Expression`s that are either not convertible to data source
    // `Filter`s or cannot be handled by `relation`.
    val filterCondition = unhandledPredicates.reduceLeftOption(expressions.And)

-    // These metadata values make scan plans uniquely identifiable for equality checking.
-    // TODO(SPARK-17701) using strings for equality checking is brittle
-    val metadata: Map[String, String] = {


We need to keep it.

The target of this cleanup PR is to remove the metadata...

gatorsmile · 2017-07-12T06:07:25Z

LGTM except the above three comments.

cloud-fan · 2017-07-12T07:44:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala

@@ -72,11 +72,6 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializ
  }

  /**
-   * @return Metadata that describes more details of this SparkPlan.
-   */
-  def metadata: Map[String, String] = Map.empty


We introduced metadata to work around the equality issue of data source scan. Now it's fixed, and we can remove it.

cloud-fan · 2017-07-12T07:46:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanInfo.scala

@@ -31,7 +31,6 @@ class SparkPlanInfo(
    val nodeName: String,
    val simpleString: String,
    val children: Seq[SparkPlanInfo],
-    val metadata: Map[String, String],


This is a developer api and I don't think users can do anything useful with metadata because it was just a hack. should be safe to remove it.

It seems that @LantaoJin brings back this for event loging at #22353 .

SparkQA · 2017-07-12T10:14:24Z

Test build #79552 has finished for PR 18600 at commit 2dc4ce1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-07-12T16:23:26Z

LGTM

gatorsmile · 2017-07-12T16:24:12Z

Thanks! Merging to master.

## What changes were proposed in this pull request? In #18600 we removed the `metadata` field from `SparkPlanInfo`. This causes a problem when we replay event logs that are generated by older Spark versions. ## How was this patch tested? a regression test. Author: Wenchen Fan <wenchen@databricks.com> Closes #19237 from cloud-fan/event.

…tion like file path to event log ## What changes were proposed in this pull request? Field metadata removed from SparkPlanInfo in #18600 . Corresponding, many meta data was also removed from event SparkListenerSQLExecutionStart in Spark event log. If we want to analyze event log to get all input paths, we couldn't get them. Instead, simpleString of SparkPlanInfo JSON only display 100 characters, it won't help. Before 2.3, the fragment of SparkListenerSQLExecutionStart in event log looks like below (It contains the metadata field which has the intact information): >{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart", Location: InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4..., "metadata": {"Location": "InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4/test5/snapshot/dt=20180904]","ReadSchema":"struct<snpsht_start_dt:date,snpsht_end_dt:date,am_ntlogin_name:string,am_first_name:string,am_last_name:string,isg_name:string,CRE_DATE:date,CRE_USER:string,UPD_DATE:timestamp,UPD_USER:string>"} After #18600, metadata field was removed. >{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart", Location: InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4..., So I add this field back to SparkPlanInfo class. Then it will log out the meta data to event log. Intact information in event log is very useful for offline job analysis. ## How was this patch tested? Unit test Closes #22353 from LantaoJin/SPARK-25357. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6dc5921) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…tion like file path to event log ## What changes were proposed in this pull request? Field metadata removed from SparkPlanInfo in apache#18600 . Corresponding, many meta data was also removed from event SparkListenerSQLExecutionStart in Spark event log. If we want to analyze event log to get all input paths, we couldn't get them. Instead, simpleString of SparkPlanInfo JSON only display 100 characters, it won't help. Before 2.3, the fragment of SparkListenerSQLExecutionStart in event log looks like below (It contains the metadata field which has the intact information): >{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart", Location: InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4..., "metadata": {"Location": "InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4/test5/snapshot/dt=20180904]","ReadSchema":"struct<snpsht_start_dt:date,snpsht_end_dt:date,am_ntlogin_name:string,am_first_name:string,am_last_name:string,isg_name:string,CRE_DATE:date,CRE_USER:string,UPD_DATE:timestamp,UPD_USER:string>"} After apache#18600, metadata field was removed. >{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart", Location: InMemoryFileIndex[hdfs://cluster1/sys/edw/test1/test2/test3/test4..., So I add this field back to SparkPlanInfo class. Then it will log out the meta data to event log. Intact information in event log is very useful for offline job analysis. ## How was this patch tested? Unit test Closes apache#22353 from LantaoJin/SPARK-25357. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Refactor DataSourceScanExec so its sameResult call does not compare s…

5008eb6

…trings

cloud-fan commented Jul 11, 2017

View reviewed changes

dongjoon-hyun reviewed Jul 11, 2017

View reviewed changes

gatorsmile reviewed Jul 12, 2017

View reviewed changes

address comments

2dc4ce1

cloud-fan commented Jul 12, 2017

View reviewed changes

asfgit closed this in 780586a Jul 12, 2017

cloud-fan mentioned this pull request Sep 14, 2017

[SPARK-21987][SQL] fix a compatibility issue of sql event logs #19237

Closed

LantaoJin mentioned this pull request Sep 6, 2018

[SPARK-25357][SQL] Add metadata to SparkPlanInfo to dump more information like file path to event log #22353

Closed

huaxingao mentioned this pull request Aug 12, 2020

[SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec #29415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17701][SQL] Refactor RowDataSourceScanExec so its sameResult call does not compare strings #18600

[SPARK-17701][SQL] Refactor RowDataSourceScanExec so its sameResult call does not compare strings #18600

cloud-fan commented Jul 11, 2017

cloud-fan Jul 11, 2017

cloud-fan Jul 11, 2017

cloud-fan commented Jul 11, 2017

SparkQA commented Jul 11, 2017

cloud-fan commented Jul 11, 2017

SparkQA commented Jul 11, 2017

dongjoon-hyun Jul 11, 2017 •

edited

Loading

dongjoon-hyun Jul 11, 2017 •

edited

Loading

dongjoon-hyun Jul 11, 2017

gatorsmile Jul 12, 2017

gatorsmile Jul 12, 2017

gatorsmile Jul 12, 2017

gatorsmile Jul 12, 2017

cloud-fan Jul 12, 2017

gatorsmile commented Jul 12, 2017

cloud-fan Jul 12, 2017

cloud-fan Jul 12, 2017

dongjoon-hyun Sep 10, 2018

SparkQA commented Jul 12, 2017

gatorsmile commented Jul 12, 2017

gatorsmile commented Jul 12, 2017

		extends DataSourceScanExec {

		def output: Seq[Attribute] = requiredColumnsIndex.map(fullOutput)

[SPARK-17701][SQL] Refactor RowDataSourceScanExec so its sameResult call does not compare strings #18600

[SPARK-17701][SQL] Refactor RowDataSourceScanExec so its sameResult call does not compare strings #18600

Conversation

cloud-fan commented Jul 11, 2017

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jul 11, 2017

SparkQA commented Jul 11, 2017

cloud-fan commented Jul 11, 2017

SparkQA commented Jul 11, 2017

dongjoon-hyun Jul 11, 2017 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun Jul 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jul 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 12, 2017

gatorsmile commented Jul 12, 2017

gatorsmile commented Jul 12, 2017

dongjoon-hyun Jul 11, 2017 •

edited

Loading

dongjoon-hyun Jul 11, 2017 •

edited

Loading