-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27106][SQL] merge CaseInsensitiveStringMap and DataSourceOptions #24025
Conversation
* Returns the boolean value to which the specified key is mapped, | ||
* or defaultValue if there is no mapping for the key. The key match is case-insensitive | ||
*/ | ||
public boolean getBoolean(String key, boolean defaultValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 4 methods are from DataSourceOptions
, which are pretty general and useful.
* A simple test suite to verify `DataSourceOptions`. | ||
*/ | ||
class DataSourceOptionsSuite extends SparkFunSuite { | ||
class CaseInsensitiveStringMapSuite extends SparkFunSuite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's awkward to write test in Java. I rewrite it in Scala and merge it with the original DataSourceOptionsSuite
I think this PR changes too many files... |
|
||
abstract class FileTable( | ||
sparkSession: SparkSession, | ||
options: DataSourceOptions, | ||
options: CaseInsensitiveStringMap, | ||
paths: Seq[String], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @cloud-fan .
Should we change FileTable
signature to accept paths
additionally for merging DataSourceOptions
and CaseInsensitiveStringMap
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not a big deal. I did this because we need paths in the OrcDataSourceV2
as well, so we can calculate the paths only once in the OrcDataSourceV2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks!
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala
Show resolved
Hide resolved
} | ||
val checkFilesExistsOption = DataSourceOptions.CHECK_FILES_EXIST_KEY -> "true" | ||
// TODO: remove this option. | ||
val checkFilesExistsOption = "check_files_exist" -> "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you file a JIRA and make this as an IDed TODO please?
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala
Show resolved
Hide resolved
@@ -306,8 +307,8 @@ class StreamingDataSourceV2Suite extends StreamTest { | |||
testPositiveCaseWithQuery(readSource, writeSource, trigger) { _ => | |||
eventually(timeout(streamingTimeout)) { | |||
// Write options should not be set. | |||
assert(LastWriteOptions.options.getBoolean(readOptionName, false) == false) | |||
assert(LastReadOptions.options.getBoolean(readOptionName, false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR adds CaseInsensitiveStringMap.getBoolean
, we don't need to change line 310.
@@ -317,8 +318,8 @@ class StreamingDataSourceV2Suite extends StreamTest { | |||
testPositiveCaseWithQuery(readSource, writeSource, trigger) { _ => | |||
eventually(timeout(streamingTimeout)) { | |||
// Read options should not be set. | |||
assert(LastReadOptions.options.getBoolean(writeOptionName, false) == false) | |||
assert(LastWriteOptions.options.getBoolean(writeOptionName, false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for line 321.
Test build #103216 has finished for PR 24025 at commit
|
One goal is to remove these pre-defined option keys, as the options should just be a general string-to-string map. I don't think it's a good idea to keep both |
Test build #103254 has finished for PR 24025 at commit
|
retest this please |
Test build #103257 has finished for PR 24025 at commit
|
Test build #4601 has started for PR 24025 at commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me as a cleanup
Retest this please. |
Test build #103272 has finished for PR 24025 at commit
|
Retest this please. |
throw new AnalysisException("Set a port to read from with option(\"port\", ...).") | ||
} | ||
Try { | ||
params.get("includeTimestamp").orElse("false").toBoolean | ||
params.getBoolean("includeTimestamp", false) | ||
} match { | ||
case Success(_) => | ||
case Failure(_) => | ||
throw new AnalysisException("includeTimestamp must be set to either \"true\" or \"false\"") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @cloud-fan .
It seems that we need to change this Try
logic. For invalid values like fasle
,
- Previously,
IllegalArgumentException
is thrown by ScalaStringLike.parseBoolean
- Now, Java
Boolean.parseBoolean
returnsfalse
without exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
Test build #103284 has finished for PR 24025 at commit
|
Test build #103293 has finished for PR 24025 at commit
|
Test build #103359 has finished for PR 24025 at commit
|
retest this please |
|
||
/** | ||
* Returns the integer value to which the specified key is mapped, | ||
* or defaultValue if there is no mapping for the key. The key match is case-insensitive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add .
at the end of line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's too minor to trigger another QA round. I'll fix it in another PR if the current QA round passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I search all the java/scala/markdown files and there is no DataSourceOptions now.
Test build #103366 has finished for PR 24025 at commit
|
Oh, it's weird. So far, there is no successful Jenkins run in this PR. |
Retest this please. |
Test build #103376 has finished for PR 24025 at commit
|
retest this please |
Hi, @cloud-fan . Could you check the test failure at
|
Test build #103408 has finished for PR 24025 at commit
|
Test build #103404 has finished for PR 24025 at commit
|
retest this please |
@@ -155,7 +155,7 @@ class RateStreamMicroBatchStream( | |||
|
|||
override def toString: String = s"RateStreamV2[rowsPerSecond=$rowsPerSecond, " + | |||
s"rampUpTimeSeconds=$rampUpTimeSeconds, " + | |||
s"numPartitions=${options.get(NUM_PARTITIONS).orElse("default")}" | |||
s"numPartitions=${Option(options.get(NUM_PARTITIONS)).getOrElse("default")}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: options.getOrDefault(NUM_PARTITIONS, "default")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too minor to update...Hopefully this time all tests are passed.
Test build #103423 has finished for PR 24025 at commit
|
Option(map.get("paths")).map { pathStr => | ||
objectMapper.readValue(pathStr, classOf[Array[String]]).toSeq | ||
}.orElse(Option(map.get("path")).map(Seq(_))).getOrElse { | ||
throw new IllegalArgumentException("'path' must be given when reading files.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 'path' or 'paths' must be ...
@@ -44,7 +44,7 @@ trait FileDataSourceV2 extends TableProvider with DataSourceRegister { | |||
Option(map.get("paths")).map { pathStr => | |||
objectMapper.readValue(pathStr, classOf[Array[String]]).toSeq | |||
}.orElse(Option(map.get("path")).map(Seq(_))).getOrElse { | |||
throw new IllegalArgumentException("'path' must be given when reading files.") | |||
Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protected def getPaths(map: CaseInsensitiveStringMap): Seq[String] = {
Option(map.get("paths")).map { pathStr =>
val objectMapper = new ObjectMapper()
objectMapper.readValue(pathStr, classOf[Array[String]]).toSeq
}.getOrElse {
Option(map.get("path")).toSeq
}
}
Test build #103433 has finished for PR 24025 at commit
|
Test build #103435 has finished for PR 24025 at commit
|
thanks, merging to master! |
Thanks for working on this, @cloud-fan! |
It's a little awkward to have 2 different classes(`CaseInsensitiveStringMap` and `DataSourceOptions`) to present the options in data source and catalog API. This PR merges these 2 classes, while keeping the name `CaseInsensitiveStringMap`, which is more precise. existing tests Closes apache#24025 from cloud-fan/option. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
It's a little awkward to have 2 different classes(
CaseInsensitiveStringMap
andDataSourceOptions
) to present the options in data source and catalog API.This PR merges these 2 classes, while keeping the name
CaseInsensitiveStringMap
, which is more precise.How was this patch tested?
existing tests