-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26456][SQL] Cast date/timestamp to string by Date/TimestampFormatter #23391
Conversation
Test build #100479 has finished for PR 23391 at commit
|
Test build #100480 has finished for PR 23391 at commit
|
…ormat # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala
Test build #100494 has finished for PR 23391 at commit
|
Test build #100501 has finished for PR 23391 at commit
|
Test build #100499 has finished for PR 23391 at commit
|
@cloud-fan @srowen @dongjoon-hyun @HyukjinKwon May I ask you to review this PR. |
Test build #100510 has finished for PR 23391 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Outdated
Show resolved
Hide resolved
@@ -88,11 +88,18 @@ class LegacyFallbackDateFormatter( | |||
} | |||
|
|||
object DateFormatter { | |||
val defaultPattern: String = "yyyy-MM-dd" | |||
val defaultLocale: Locale = Locale.US | |||
|
|||
def apply(format: String, locale: Locale): DateFormatter = { | |||
if (SQLConf.get.legacyTimeParserEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would you need this? You could argue that since we are moving to Spark 3.0 we don't need to care as much about legacy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR, date and timestamp patterns are fixed, and we shouldn't see any behavior changes but DateFormatter
/TimestampFormatter
are used from CSV/JSON Datasources and from a functions where user can set any patterns. Unfortunately supported patterns by SimpleDateFormat
and DateTimeFormat
are not absolutely the same. Also there are other differences in their behavior: https://github.com/apache/spark/pull/23358/files#diff-3f19ec3d15dcd8cd42bb25dde1c5c1a9R42
What I have learned from other PRs, if I introduce a behavior change, I should leave opportunity to users to come back to previous behavior. Later the old behavior can be deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just saying that we are going to break stuff anyway. If the legacy behavior is somewhat unreasonable, then we should consider not supporting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting rid of the flag in the PR is slightly out of its scope, I believe. I would prefer to open a ticket and leave that to somebody who is much more brave.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ticket to remove the flag: https://issues.apache.org/jira/browse/SPARK-26503
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarification, I don't think we should treat the previous behaviour unreasonable ... I am okay with considering to remove that legacy configuration regarding that we're going ahead for 3.0, it causes some overhead about maintenance, and it blocks some features.
Also, for clarification, it's kind of a breaking changes. Think about that the CSV codes were dependent on timestamp being inferred and suddenly it becomes strings after upgrade. Even, this behaviour was documented in 2.x (by referring SimpleDateFormat
).
@@ -111,6 +110,9 @@ class QueryExecution( | |||
protected def stringOrError[A](f: => A): String = | |||
try f.toString catch { case e: AnalysisException => e.toString } | |||
|
|||
private val dateFormatter = DateFormatter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably get rid of the hiveResultString
method for 3.0. It does not make much sense to keep it in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I create a separate JIRA for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please. We should just move that into the a test class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened JIRA ticket: https://issues.apache.org/jira/browse/SPARK-26502
|
||
def apply(format: String): DateFormatter = apply(format, defaultLocale) | ||
|
||
def apply(): DateFormatter = apply(defaultPattern) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both formatters seems to use thread safe implementations. You could consider just returning cached instances here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, both formatters are created per partition at least not per row. Do you think it makes sense to cache them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, lets leave it for now.
Test build #100516 has finished for PR 23391 at commit
|
…ormat # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala
Test build #100687 has finished for PR 23391 at commit
|
…ormat # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala
Test build #100880 has finished for PR 23391 at commit
|
jenkins, retest this, please |
Test build #100885 has finished for PR 23391 at commit
|
@srowen @HyukjinKwon @cloud-fan Could you look at the PR one more time, please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confess that you know this change much better than I will, but from reviewing previous PRs and my understanding of the issues, this looks good.
## What changes were proposed in this pull request? Per discussion in #23391 (comment) this proposes to just remove the old pre-Spark-3 time parsing behavior. This is a rebase of #23411 ## How was this patch tested? Existing tests. Closes #23495 from srowen/SPARK-26503.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
Test build #101098 has finished for PR 23391 at commit
|
Let's also update PR description. |
sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala
Outdated
Show resolved
Hide resolved
LGTM |
Test build #101168 has finished for PR 23391 at commit
|
jenkins, retest this, please |
Looks fine to me too. |
Test build #101177 has finished for PR 23391 at commit
|
thanks, merging to master! |
## What changes were proposed in this pull request? Per discussion in apache#23391 (comment) this proposes to just remove the old pre-Spark-3 time parsing behavior. This is a rebase of apache#23411 ## How was this patch tested? Existing tests. Closes apache#23495 from srowen/SPARK-26503.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
…matter ## What changes were proposed in this pull request? In the PR, I propose to switch on `TimestampFormatter`/`DateFormatter` in casting dates/timestamps to strings. The changes should make the date/timestamp casting consistent to JSON/CSV datasources and time-related functions like `to_date`, `to_unix_timestamp`/`from_unixtime`. Local formatters are moved out from `DateTimeUtils` to where they are actually used. It allows to avoid re-creation of new formatter instance per-each call. Another reason is to have separate parser for `PartitioningUtils` because default parsing pattern cannot be used (expected optional section `[.S]`). ## How was this patch tested? It was tested by `DateTimeUtilsSuite`, `CastSuite` and `JDBC*Suite`. Closes apache#23391 from MaxGekk/thread-local-date-format. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
case (t: Timestamp, TimestampType) => | ||
val timeZone = DateTimeUtils.getTimeZone(SQLConf.get.sessionLocalTimeZone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk @cloud-fan by moving getting the timezone from here to a lazy val in the object, it will be initialized only once by the first session that uses it. Another session with a different sessionLocalTimeZone set will get results in wrong timezone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juliuszsompolski Thank you for the bug report. I will fix the issue. I think it is ok to create formatters in place because they can be pulled from caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR #28024
What changes were proposed in this pull request?
In the PR, I propose to switch on
TimestampFormatter
/DateFormatter
in casting dates/timestamps to strings. The changes should make the date/timestamp casting consistent to JSON/CSV datasources and time-related functions liketo_date
,to_unix_timestamp
/from_unixtime
.Local formatters are moved out from
DateTimeUtils
to where they are actually used. It allows to avoid re-creation of new formatter instance per-each call. Another reason is to have separate parser forPartitioningUtils
because default parsing pattern cannot be used (expected optional section[.S]
).How was this patch tested?
It was tested by
DateTimeUtilsSuite
,CastSuite
andJDBC*Suite
.