[SPARK-26456][SQL] Cast date/timestamp to string by Date/TimestampFormatter #23391

MaxGekk · 2018-12-27T15:11:44Z

What changes were proposed in this pull request?

In the PR, I propose to switch on TimestampFormatter/DateFormatter in casting dates/timestamps to strings. The changes should make the date/timestamp casting consistent to JSON/CSV datasources and time-related functions like to_date, to_unix_timestamp/from_unixtime.

Local formatters are moved out from DateTimeUtils to where they are actually used. It allows to avoid re-creation of new formatter instance per-each call. Another reason is to have separate parser for PartitioningUtils because default parsing pattern cannot be used (expected optional section [.S]).

How was this patch tested?

It was tested by DateTimeUtilsSuite, CastSuite and JDBC*Suite.

SparkQA · 2018-12-27T18:08:05Z

Test build #100479 has finished for PR 23391 at commit 697688a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-28T01:15:03Z

Test build #100480 has finished for PR 23391 at commit f0a9fe7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ormat # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala

SparkQA · 2018-12-28T13:57:58Z

Test build #100494 has finished for PR 23391 at commit 5f0b0a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-28T16:03:46Z

Test build #100501 has finished for PR 23391 at commit 8541602.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-28T16:44:44Z

Test build #100499 has finished for PR 23391 at commit 56bdae4.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

…ormat

MaxGekk · 2018-12-28T17:37:23Z

@cloud-fan @srowen @dongjoon-hyun @HyukjinKwon May I ask you to review this PR.

SparkQA · 2018-12-28T21:39:47Z

Test build #100510 has finished for PR 23391 at commit ecf0e89.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

hvanhovell · 2018-12-28T21:52:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala

@@ -88,11 +88,18 @@ class LegacyFallbackDateFormatter(
 }

 object DateFormatter {
+  val defaultPattern: String = "yyyy-MM-dd"
+  val defaultLocale: Locale = Locale.US
+
  def apply(format: String, locale: Locale): DateFormatter = {
    if (SQLConf.get.legacyTimeParserEnabled) {


When would you need this? You could argue that since we are moving to Spark 3.0 we don't need to care as much about legacy.

In this PR, date and timestamp patterns are fixed, and we shouldn't see any behavior changes but DateFormatter/TimestampFormatter are used from CSV/JSON Datasources and from a functions where user can set any patterns. Unfortunately supported patterns by SimpleDateFormat and DateTimeFormat are not absolutely the same. Also there are other differences in their behavior: https://github.com/apache/spark/pull/23358/files#diff-3f19ec3d15dcd8cd42bb25dde1c5c1a9R42

What I have learned from other PRs, if I introduce a behavior change, I should leave opportunity to users to come back to previous behavior. Later the old behavior can be deprecated.

I am just saying that we are going to break stuff anyway. If the legacy behavior is somewhat unreasonable, then we should consider not supporting it.

Getting rid of the flag in the PR is slightly out of its scope, I believe. I would prefer to open a ticket and leave that to somebody who is much more brave.

The ticket to remove the flag: https://issues.apache.org/jira/browse/SPARK-26503

For clarification, I don't think we should treat the previous behaviour unreasonable ... I am okay with considering to remove that legacy configuration regarding that we're going ahead for 3.0, it causes some overhead about maintenance, and it blocks some features.

Also, for clarification, it's kind of a breaking changes. Think about that the CSV codes were dependent on timestamp being inferred and suddenly it becomes strings after upgrade. Even, this behaviour was documented in 2.x (by referring SimpleDateFormat).

hvanhovell · 2018-12-28T22:49:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

@@ -111,6 +110,9 @@ class QueryExecution(
  protected def stringOrError[A](f: => A): String =
    try f.toString catch { case e: AnalysisException => e.toString }

+  private val dateFormatter = DateFormatter()


We should probably get rid of the hiveResultString method for 3.0. It does not make much sense to keep it in there.

Should I create a separate JIRA for that?

Yes please. We should just move that into the a test class.

Opened JIRA ticket: https://issues.apache.org/jira/browse/SPARK-26502

hvanhovell · 2018-12-28T22:54:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala

+
+  def apply(format: String): DateFormatter = apply(format, defaultLocale)
+
+  def apply(): DateFormatter = apply(defaultPattern)


Both formatters seems to use thread safe implementations. You could consider just returning cached instances here.

At the moment, both formatters are created per partition at least not per row. Do you think it makes sense to cache them?

Ok, lets leave it for now.

SparkQA · 2018-12-29T03:13:40Z

Test build #100516 has finished for PR 23391 at commit c3066e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ormat # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

SparkQA · 2019-01-03T16:54:23Z

Test build #100687 has finished for PR 23391 at commit 5ae58a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ormat # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala

SparkQA · 2019-01-07T13:09:49Z

Test build #100880 has finished for PR 23391 at commit 2397401.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HiveResultSuite extends SparkFunSuite with SharedSQLContext

MaxGekk · 2019-01-07T13:24:04Z

jenkins, retest this, please

SparkQA · 2019-01-07T17:25:37Z

Test build #100885 has finished for PR 23391 at commit 2397401.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HiveResultSuite extends SparkFunSuite with SharedSQLContext

MaxGekk · 2019-01-10T21:48:25Z

@srowen @HyukjinKwon @cloud-fan Could you look at the PR one more time, please.

srowen

I confess that you know this change much better than I will, but from reviewing previous PRs and my understanding of the issues, this looks good.

## What changes were proposed in this pull request? Per discussion in #23391 (comment) this proposes to just remove the old pre-Spark-3 time parsing behavior. This is a rebase of #23411 ## How was this patch tested? Existing tests. Closes #23495 from srowen/SPARK-26503.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

…ormat

SparkQA · 2019-01-12T00:21:08Z

Test build #101098 has finished for PR 23391 at commit 206c955.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-13T04:19:56Z

Let's also update PR description. spark.sql.legacy.timeParser.enabled is now removed.

sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala

cloud-fan · 2019-01-14T06:42:40Z

LGTM

SparkQA · 2019-01-14T09:56:35Z

Test build #101168 has finished for PR 23391 at commit 483e95b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-01-14T09:58:33Z

jenkins, retest this, please

HyukjinKwon · 2019-01-14T12:35:26Z

Looks fine to me too.

SparkQA · 2019-01-14T13:48:22Z

Test build #101177 has finished for PR 23391 at commit 483e95b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-01-14T13:59:37Z

thanks, merging to master!

## What changes were proposed in this pull request? Per discussion in apache#23391 (comment) this proposes to just remove the old pre-Spark-3 time parsing behavior. This is a rebase of apache#23411 ## How was this patch tested? Existing tests. Closes apache#23495 from srowen/SPARK-26503.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

…matter ## What changes were proposed in this pull request? In the PR, I propose to switch on `TimestampFormatter`/`DateFormatter` in casting dates/timestamps to strings. The changes should make the date/timestamp casting consistent to JSON/CSV datasources and time-related functions like `to_date`, `to_unix_timestamp`/`from_unixtime`. Local formatters are moved out from `DateTimeUtils` to where they are actually used. It allows to avoid re-creation of new formatter instance per-each call. Another reason is to have separate parser for `PartitioningUtils` because default parsing pattern cannot be used (expected optional section `[.S]`). ## How was this patch tested? It was tested by `DateTimeUtilsSuite`, `CastSuite` and `JDBC*Suite`. Closes apache#23391 from MaxGekk/thread-local-date-format. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

juliuszsompolski · 2020-03-24T17:48:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala

    case (t: Timestamp, TimestampType) =>
-      val timeZone = DateTimeUtils.getTimeZone(SQLConf.get.sessionLocalTimeZone)


@MaxGekk @cloud-fan by moving getting the timezone from here to a lazy val in the object, it will be initialized only once by the first session that uses it. Another session with a different sessionLocalTimeZone set will get results in wrong timezone.

@juliuszsompolski Thank you for the bug report. I will fix the issue. I think it is ok to create formatters in place because they can be pulled from caches.

good point!

Here is the PR #28024

MaxGekk added 2 commits December 27, 2018 15:28

Switching to TimestampFormatter

e09c972

Switching to DateFormatter

697688a

MaxGekk changed the title ~~[SPARK-26456][SQL] Cast date/timestamp by Date/TimestampFormatter~~ [SPARK-26456][SQL] Cast date/timestamp to string by Date/TimestampFormatter Dec 27, 2018

MaxGekk added 2 commits December 27, 2018 22:17

Separate parser for PartitioningUtils

fcffc12

Revert unneeded changes

f0a9fe7

MaxGekk added 11 commits December 28, 2018 10:59

Moving partition timestamp and date parsers to PartitioningUtils

5f0b0a3

Add local date and timestamp to Cast

17a32a3

Add local date and timestamp formatters to QueryExecution

9d90d3f

Add local date and timestamp formatters to JDBC

038ad80

Removing unused timestampToString

f6308f6

Removing unused dateToString

9f85ac6

Fix build error

d348c07

Remove unused threadLocalTimestampFormat

36c9f9a

Revert changes in TimestampFormatter

4580c25

apply with default locale

56bdae4

Merge remote-tracking branch 'origin/master' into thread-local-date-f…

8541602

…ormat # Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala

Merge remote-tracking branch 'origin/master' into thread-local-date-f…

ecf0e89

…ormat

hvanhovell reviewed Dec 28, 2018

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Dec 28, 2018

View reviewed changes

Making dateFormatter lazy in cast

c3066e1

MaxGekk added 2 commits January 3, 2019 13:43

Merge remote-tracking branch 'origin/master' into thread-local-date-f…

1156291

…ormat # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

Fix merge

5ae58a3

MaxGekk mentioned this pull request Jan 6, 2019

[SPARK-26546][SQL] Caching of java.time.format.DateTimeFormatter #23462

Closed

MaxGekk added 2 commits January 7, 2019 12:02

Merge remote-tracking branch 'origin/master' into thread-local-date-f…

78c3961

…ormat # Conflicts: # sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala

Fix merge

2397401

srowen mentioned this pull request Jan 9, 2019

[SPARK-26503][CORE] Get rid of spark.sql.legacy.timeParser.enabled #23495

Closed

srowen approved these changes Jan 10, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into thread-local-date-f…

206c955

…ormat

cloud-fan reviewed Jan 14, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala Outdated Show resolved Hide resolved

Remove unused import

483e95b

cloud-fan closed this in 115fecf Jan 14, 2019

cloud-fan mentioned this pull request Jan 22, 2019

[SPARK-26653][SQL] Use Proleptic Gregorian calendar in parsing JDBC lower/upper bounds #23597

Closed

MaxGekk deleted the thread-local-date-format branch August 17, 2019 13:36

juliuszsompolski reviewed Mar 24, 2020

View reviewed changes


		def apply(format: String): DateFormatter = apply(format, defaultLocale)

		def apply(): DateFormatter = apply(defaultPattern)

		case (t: Timestamp, TimestampType) =>
		val timeZone = DateTimeUtils.getTimeZone(SQLConf.get.sessionLocalTimeZone)

[SPARK-26456][SQL] Cast date/timestamp to string by Date/TimestampFormatter #23391

[SPARK-26456][SQL] Cast date/timestamp to string by Date/TimestampFormatter #23391

Conversation

MaxGekk commented Dec 27, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Dec 27, 2018

SparkQA commented Dec 28, 2018

SparkQA commented Dec 28, 2018

SparkQA commented Dec 28, 2018

SparkQA commented Dec 28, 2018

MaxGekk commented Dec 28, 2018

SparkQA commented Dec 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell Dec 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 29, 2018

SparkQA commented Jan 3, 2019

SparkQA commented Jan 7, 2019

MaxGekk commented Jan 7, 2019

SparkQA commented Jan 7, 2019

MaxGekk commented Jan 10, 2019

srowen left a comment

Choose a reason for hiding this comment

SparkQA commented Jan 12, 2019

HyukjinKwon commented Jan 13, 2019

cloud-fan commented Jan 14, 2019

SparkQA commented Jan 14, 2019

MaxGekk commented Jan 14, 2019

HyukjinKwon commented Jan 14, 2019

SparkQA commented Jan 14, 2019

cloud-fan commented Jan 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaxGekk commented Dec 27, 2018 •

edited

Loading

hvanhovell Dec 28, 2018 •

edited

Loading