-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roundtrip null values of any type #147
Roundtrip null values of any type #147
Conversation
Current coverage is
|
@@ -110,6 +110,14 @@ class DefaultSource | |||
} else { | |||
throw new Exception("Ignore white space flag can be true or false") | |||
} | |||
val treatEmptyValuesAsNulls = parameters.getOrElse("treatEmptyValuesAsNulls", "false") | |||
val treatEmptyValuesAsNullsFlag = if(treatEmptyValuesAsNulls == "false") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit: if (treatEmptyValuesAsNulls ...
Leave a space between the parenthesis and if here and everywhere else in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yes. I prefer the same, but I was keeping it consistent with the rest of the file's style. Would you prefer I add the space here, add it everywhere else in the file as well, or leave it unchanged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we just merged a PR that unifies the style across the code and adds Scala style checks.
@@ -35,8 +35,9 @@ object TypeCast { | |||
* @param datum string value | |||
* @param castType SparkSQL type | |||
*/ | |||
private[csv] def castTo(datum: String, castType: DataType, nullable: Boolean = true): Any = { | |||
if (datum == "" && nullable && !castType.isInstanceOf[StringType]){ | |||
private[csv] def castTo(datum: String, castType: DataType, nullable: Boolean = true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style:
castTo(
datum: String,
castType: DataType,
nullable: Boolean = true,
treatEmptyValuesAsNulls: Boolean = false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I think the binary incompatibility problem is that we don't have something like Spark's MiMa exclude generator to handle the private[csv]
excludes.
@andy327 left a few comments. Thanks for submitting this. Would you please rebase it? |
I brought the branch up-to-date with commits leading up to yesterday. Tests pass, but the Travis build fails for other reasons (Binary compatibility check failed). |
Here's the MiMa error: [error] * synthetic method tsvFile$default$6()java.lang.String in class com.databricks.spark.csv.package#CsvContext has now a different result type; was: java.lang.String, is now: Boolean
[error] filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("com.databricks.spark.csv.package#CsvContext.tsvFile$default$6")
[error] * synthetic method csvFile$default$11()java.lang.String in class com.databricks.spark.csv.package#CsvContext has now a different result type; was: java.lang.String, is now: Boolean
[error] filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("com.databricks.spark.csv.package#CsvContext.csvFile$default$11")
[error] * method tsvFile(java.lang.String,Boolean,java.lang.String,Boolean,Boolean,java.lang.String,Boolean)org.apache.spark.sql.DataFrame in class com.databricks.spark.csv.package#CsvContext does not have a correspondent in new version
[error] filter with: ProblemFilters.exclude[MissingMethodProblem]("com.databricks.spark.csv.package#CsvContext.tsvFile")
[error] * method csvFile(java.lang.String,Boolean,Char,Char,java.lang.Character,java.lang.Character,java.lang.String,java.lang.String,Boolean,Boolean,java.lang.String,Boolean)org.apache.spark.sql.DataFrame in class com.databricks.spark.csv.package#CsvContext does not have a correspondent in new version
[error] filter with: ProblemFilters.exclude[MissingMethodProblem]("com.databricks.spark.csv.package#CsvContext.csvFile")
[error] * synthetic method tsvFile$default$7()Boolean in class com.databricks.spark.csv.package#CsvContext has now a different result type; was: Boolean, is now: java.lang.String
[error] filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("com.databricks.spark.csv.package#CsvContext.tsvFile$default$7")
[error] * synthetic method csvFile$default$12()Boolean in class com.databricks.spark.csv.package#CsvContext has now a different result type; was: Boolean, is now: java.lang.String
[error] filter with: ProblemFilters.exclude[IncompatibleResultTypeProblem]("com.databricks.spark.csv.package#CsvContext.csvFile$default$12")
[error] * method castTo(java.lang.String,org.apache.spark.sql.types.DataType,Boolean)java.lang.Object in object com.databricks.spark.csv.util.TypeCast does not have a correspondent in new version
[error] filter with: ProblemFilters.exclude[MissingMethodProblem]("com.databricks.spark.csv.util.TypeCast.castTo") Will comment inline RE: these errors. |
@@ -38,6 +38,7 @@ package object csv { | |||
parserLib: String = "COMMONS", | |||
ignoreLeadingWhiteSpace: Boolean = false, | |||
ignoreTrailingWhiteSpace: Boolean = false, | |||
treatEmptyValuesAsNulls: Boolean = false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MiMa check failed because the addition of a new argument to this method changes its Java method signature and also risks source incompatibility.
@falaki, for this reason I think that, in retrospect, it would have been better to go with a builder pattern for this instead of a big Scala method with defaults arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an auxiliary method that is just for convenience. Users are encouraged to use the builder patter. For that reason, and to keep MiMa compatibility, @andy327 please remove the new option from csvFile
. In future we may completely remove this method.
…s to keep MiMa compatibility.
LGTM modulo style and MiMa check fixes. |
Unfortunately, there is no way to apply the features in this PR without modifying the signature to castTo in TypeCast.scala. It could be possible to add a castTo helper method containing the original three parameters? Otherwise this PR would require a change to the MiMa checks.. Let me know what you would prefer. Also, are there any other style fixes left? |
I think that it's fine to modify that signature in TypeCast.scala given that it's an internal API. The problem here is that the MiMa checks, as currently implemented, aren't properly respecting |
Thanks. merging this now. |
This pull request adds functionality to spark-csv with the goal of having the ability to write null values to file and read them back out again as null. Two changes were made to enable this.
First, since the
com.databricks.spark.csv
package previously had the null string hardcoded to "null
" when saving to a csv file, this was changed to read the null token out of the passed in parameters map, from the value for "nullToken
", enabling writing null values as empty strings by use of this option. The default is left to "null
" to maintain the previous behavior of the library.Secondly, the
castTo
method fromcom.databricks.spark.csv.util.TypeCast
had an impossible-to-reach case statement when thecastType
was an instance ofStringType
. As a result, it was not possible to read string values from file as null. This pull request adds a setting 'treatEmptyValuesAsNulls' that allows empty string values in fields that are marked as nullable to be read as null values, as expected. Again, the previous behavior is enabled by default, so this pull request only changes the behavior whentreatEmptyValuesAsNulls
is explicitly set to true. The appropriate changes toCsvParser
andCsvRelation
were made to include this new setting.Additionally, a unit test has been added to
CsvSuite
to test the ability to round-trip (both string and non-string) null values by writing nulls and reading them back out again as nulls.