Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25425][SQL] Extra options should override session options in DataSource V2 #22413

Closed
wants to merge 9 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Sep 13, 2018

What changes were proposed in this pull request?

In the PR, I propose overriding session options by extra options in DataSource V2. Extra options are more specific and set via .option(), and should overwrite more generic session options. Entries from seconds map overwrites entries with the same key from the first map, for example:

scala> Map("option" -> false) ++ Map("option" -> true)
res0: scala.collection.immutable.Map[String,Boolean] = Map(option -> true)

How was this patch tested?

Added a test for checking which option is propagated to a data source in load().

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 13, 2018

@cloud-fan Please, take a look.

@dongjoon-hyun
Copy link
Member

@MaxGekk . This looks worth for JIRA issue. Could you file a JIRA issue for this?

@@ -202,7 +202,7 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
DataSourceOptions.PATHS_KEY -> objectMapper.writeValueAsString(paths.toArray)
}
Dataset.ofRows(sparkSession, DataSourceV2Relation.create(
ds, extraOptions.toMap ++ sessionOptions + pathsOption,
ds, sessionOptions ++ extraOptions.toMap + pathsOption,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, cc @rdblue since this is introduced at aadf953#diff-f70bda59304588cc3abfa3a9840653f4R197 .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun The commit didn't change semantic actually. Before it was:

(extraOptions ++
        DataSourceV2Utils.extractSessionConfigs(
          ds = ds.asInstanceOf[DataSourceV2],
          conf = sparkSession.sessionState.conf))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. It has more history. Thanks, @MaxGekk . Could you trace down when it started? We need to mark the affected version correctly in order to know the backport candidates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add a test case for this. If this has a long history than SPARK-23203 (fixed at 2.4.0), we need to verify this during backporting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes were added in #19861

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Then, it was 2.3.0.

@SparkQA
Copy link

SparkQA commented Sep 13, 2018

Test build #96043 has finished for PR 22413 at commit 13729b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title Session options shouldn't override extra options [SPARK-25425][SQL] Extra options overwrite session options Sep 13, 2018

val relation = DataSourceV2Relation.create(source, options.toMap)
val relation = DataSourceV2Relation.create(source, options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both read/write-side tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a test for read path since I was able to grab options propagated to DataSource but I have no idea so far for write path, only mocking probably. Does it make sense to do that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For write path, I think we can use the traditional way instead of introducing mocking. Let me try.

@SparkQA
Copy link

SparkQA commented Sep 13, 2018

Test build #96051 has finished for PR 22413 at commit a443054.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Sep 14, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Sep 14, 2018

Test build #96057 has finished for PR 22413 at commit a443054.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but let me leave it to @cloud-fan

@cloud-fan
Copy link
Contributor

good catch! I believe this was a mistake. LGTM

@SparkQA
Copy link

SparkQA commented Sep 14, 2018

Test build #96078 has finished for PR 22413 at commit 48d4cef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DataSourceWithSessionConfV2 extends SimpleDataSourceV2 with SessionConfigSupport

@@ -385,6 +400,9 @@ class SimpleDataSourceV2 extends DataSourceV2 with BatchReadSupportProvider {
}
}

class DataSourceWithSessionConfV2 extends SimpleDataSourceV2 with SessionConfigSupport {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataSourceWithSessionConfV2 -> DataSourceV2WithSessionConfig?

@dongjoon-hyun
Copy link
Member

@MaxGekk . Could you specify DSv2 more clearly in the PR title and description because it happens DSv2 code path only?

@MaxGekk MaxGekk changed the title [SPARK-25425][SQL] Extra options overwrite session options [SPARK-25425][SQL] Extra options override session options in DataSource V2 Sep 15, 2018
@SparkQA
Copy link

SparkQA commented Sep 15, 2018

Test build #96091 has finished for PR 22413 at commit 83789c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DataSourceV2WithSessionConfigs extends SimpleDataSourceV2 with SessionConfigSupport

@@ -385,6 +400,9 @@ class SimpleDataSourceV2 extends DataSourceV2 with BatchReadSupportProvider {
}
}

class DataSourceV2WithSessionConfigs extends SimpleDataSourceV2 with SessionConfigSupport {
def keyPrefix(): String = "test"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, it seems that we can use the existing one.

@dongjoon-hyun
Copy link
Member

@MaxGekk . I made a write-path test case PR to you. Could you review that, MaxGekk#8 ?

dongjoon-hyun and others added 2 commits September 15, 2018 12:54
Add a test case for write path. Thank you very much, @dongjoon-hyun
@MaxGekk MaxGekk changed the title [SPARK-25425][SQL] Extra options override session options in DataSource V2 [SPARK-25425][SQL] Extra options should override session options in DataSource V2 Sep 15, 2018
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM pending Jenkins tests.

@SparkQA
Copy link

SparkQA commented Sep 15, 2018

Test build #96096 has finished for PR 22413 at commit eba46d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 15, 2018

Test build #96098 has finished for PR 22413 at commit 325b9c4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 16, 2018

Thank you, @MaxGekk . Merged to master.

There is some conflict. Please make backporting PRs to 2.4 and 2.3.

@asfgit asfgit closed this in e06da95 Sep 16, 2018
@rdblue
Copy link
Contributor

rdblue commented Sep 19, 2018

+1, thanks for fixing this, @dongjoon-hyun!

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 19, 2018

Oh, it was @MaxGekk . I helped some test cases. Anyway, thank you for your +1, @rdblue .
BTW, @MaxGekk . Could you send a backport PR to branch-2.4?

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 19, 2018

There is some conflict. Please make backporting PRs to 2.4 and 2.3.
BTW, @MaxGekk . Could you send a backport PR to branch-2.4?

Sure, I will do that.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Sep 19, 2018

Thank you so much, @MaxGekk . It would be great if we can have that in Spark 2.4.0 RC2.

@rdblue
Copy link
Contributor

rdblue commented Sep 19, 2018

Thanks @MaxGekk, sorry for the original omission!

MaxGekk added a commit to MaxGekk/spark that referenced this pull request Sep 19, 2018
…ataSource V2

In the PR, I propose overriding session options by extra options in DataSource V2. Extra options are more specific and set via `.option()`, and should overwrite more generic session options. Entries from seconds map overwrites entries with the same key from the first map, for example:
```Scala
scala> Map("option" -> false) ++ Map("option" -> true)
res0: scala.collection.immutable.Map[String,Boolean] = Map(option -> true)
```

Added a test for checking which option is propagated to a data source in `load()`.

Closes apache#22413 from MaxGekk/session-options.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 19, 2018

@dongjoon-hyun Here is the PR for 2.4: #22474 . I will check does it have conflicts in 2.3 and if so, I will backport it to 2.3

@MaxGekk
Copy link
Member Author

MaxGekk commented Sep 19, 2018

@dongjoon-hyun It cannot be merged to 2.3 easily because DataSourceV2Relation doesn't have the options field, and the test for read is not compilable. I will try to fix it tomorrow.

@dongjoon-hyun
Copy link
Member

I see. Thank you, @MaxGekk !

MaxGekk added a commit to MaxGekk/spark that referenced this pull request Sep 20, 2018
…ataSource V2

In the PR, I propose overriding session options by extra options in DataSource V2. Extra options are more specific and set via `.option()`, and should overwrite more generic session options. Entries from seconds map overwrites entries with the same key from the first map, for example:
```Scala
scala> Map("option" -> false) ++ Map("option" -> true)
res0: scala.collection.immutable.Map[String,Boolean] = Map(option -> true)
```

Added a test for checking which option is propagated to a data source in `load()`.

Closes apache#22413 from MaxGekk/session-options.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@MaxGekk MaxGekk deleted the session-options branch August 17, 2019 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants