Skip to content

Commit

Permalink
Allow partial date parsing when simple datetime formatter is used (fa…
Browse files Browse the repository at this point in the history
…cebookincubator#11386)

Summary:
The Spark legacy datetime formatter allows parsing date from incomplete text,
seeing [code link](https://github.com/openjdk/jdk8/blob/master/jdk/src/share/classes/java/text/DateFormat.java#L351). This PR enables partial date parsing when the `LENIENT_SIMPLE`
or `STRICT_SIMPLE` datetime formatter is used.

Relates issues: facebookincubator#10354, [gluten#6227](apache/incubator-gluten#6227)

Pull Request resolved: facebookincubator#11386

Reviewed By: pedroerp

Differential Revision: D65948039

Pulled By: Yuhta

fbshipit-source-id: 0d17084f723ebeaded7278178982b5a10d9f9fed
  • Loading branch information
NEUpanning authored and athmaja-n committed Jan 10, 2025
1 parent 1947e4d commit d7a7124
Show file tree
Hide file tree
Showing 3 changed files with 47 additions and 18 deletions.
51 changes: 35 additions & 16 deletions velox/docs/functions/spark/datetime.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,12 +82,9 @@ These functions support TIMESTAMP and DATE input types.
Adjusts ``unixTime`` (elapsed seconds since UNIX epoch) to configured session timezone, then
converts it to a formatted time string according to ``format``. Only supports BIGINT type for
``unixTime``. Using `Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`_
date formatter in lenient mode that is align with Spark legacy date parser behavior or
`Joda <https://www.joda.org/joda-time/>`_ date formatter depends on ``spark.legacy_date_formatter`` configuration.
``unixTime``.
`Valid patterns for date format
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. When `Simple` date formatter is used,
null is returned for invalid ``format``; otherwise, exception is thrown. This function will convert input to
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. This function will convert input to
milliseconds, and integer overflow is allowed in the conversion, which aligns with Spark. See the below third
example where INT64_MAX is used, -1000 milliseconds are produced by INT64_MAX * 1000 due to integer overflow. ::

Expand All @@ -112,17 +109,11 @@ These functions support TIMESTAMP and DATE input types.
Returns timestamp by parsing ``string`` according to the specified ``dateFormat``.
The format follows Spark's
`Datetime patterns
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_.
Using `Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`_
date formatter in lenient mode that is align with Spark legacy date parser behavior or
`Joda <https://www.joda.org/joda-time/>`_ date formatter depends on ``spark.legacy_date_formatter`` configuration.
Returns NULL for parsing error or NULL input. When `Simple` date formatter is used, null is returned for invalid
``dateFormat``; otherwise, exception is thrown. ::
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. ::

SELECT get_timestamp('1970-01-01', 'yyyy-MM-dd); -- timestamp `1970-01-01`
SELECT get_timestamp('1970-01-01', 'yyyy-MM'); -- NULL (parsing error)
SELECT get_timestamp('1970-01-01', null); -- NULL
SELECT get_timestamp('2020-06-10', 'A'); -- (throws exception)

.. spark:function:: hour(timestamp) -> integer
Expand Down Expand Up @@ -291,10 +282,7 @@ These functions support TIMESTAMP and DATE input types.

.. spark:function:: unix_timestamp() -> integer
Returns the current UNIX timestamp in seconds. Using
`Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`_ date formatter in lenient mode
that is align with Spark legacy date parser behavior or `Joda <https://www.joda.org/joda-time/>`_ date formatter
depends on the ``spark.legacy_date_formatter`` configuration.
Returns the current UNIX timestamp in seconds.

.. spark:function:: unix_timestamp(string) -> integer
:noindex:
Expand Down Expand Up @@ -337,3 +325,34 @@ These functions support TIMESTAMP and DATE input types.
part of the 53rd week of year 2004, so the result is 2004. Only supports DATE type.

SELECT year_of_week('2005-01-02'); -- 2004

Simple vs. Joda Date Formatter
------------------------------

To align with Spark, Velox supports both `Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`_
and `Joda <https://www.joda.org/joda-time/>`_ date formmaters to parse/format timestamp/date strings
used in functions :spark:func:`from_unixtime`, :spark:func:`unix_timestamp`, :spark:func:`make_date`
and :spark:func:`to_unix_timestamp`.
If the configuration setting :doc:`spark.legacy_date_formatter <../../configs>` is true,
`Simple` date formmater in lenient mode is used; otherwise, `Joda` is used. It is important
to note that there are some different behaviors between these two formatters.

For :spark:func:`unix_timestamp` and :spark:func:`get_timestamp`, the `Simple` date formatter permits partial date parsing
which means that format can match only a part of input string. For example, if input string is
2015-07-22 10:00:00, it can be parsed using format is yyyy-MM-dd because the parser does not require entire
input to be consumed. In contrast, the `Joda` date formatter performs strict checks to ensure that the
format completely matches the string. If there is any mismatch, exception is thrown. ::

SELECT get_timestamp('2015-07-22 10:00:00', 'yyyy-MM-dd'); -- timestamp `2015-07-22` (for Simple date formatter)
SELECT get_timestamp('2015-07-22 10:00:00', 'yyyy-MM-dd'); -- (throws exception) (for Joda date formatter)
SELECT unix_timestamp('2016-04-08 00:00:00', 'yyyy-MM-dd'); -- 1460041200 (for Simple date formatter)
SELECT unix_timestamp('2016-04-08 00:00:00', 'yyyy-MM-dd'); -- (throws exception) (for Joda date formatter)

For :spark:func:`from_unixtime` and :spark:func:`get_timestamp`, when `Simple` date formatter is used, null is
returned for invalid format; otherwise, exception is thrown. ::

SELECT from_unixtime(100, '!@#$%^&*'); -- NULL (parsing error) (for Simple date formatter)
SELECT from_unixtime(100, '!@#$%^&*'); -- throws exception) (for Joda date formatter)
SELECT get_timestamp('1970-01-01', '!@#$%^&*'); -- NULL (parsing error) (for Simple date formatter)
SELECT get_timestamp('1970-01-01', '!@#$%^&*'); -- throws exception) (for Joda date formatter)

5 changes: 3 additions & 2 deletions velox/functions/lib/DateTimeFormatter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1588,8 +1588,9 @@ Expected<DateTimeResult> DateTimeFormatter::parse(
}
}

// Ensure all input was consumed.
if (cur < end) {
// Ensure all input was consumed if type_ is not simple datetime formatter.
if (type_ != DateTimeFormatterType::LENIENT_SIMPLE &&
type_ != DateTimeFormatterType::STRICT_SIMPLE && cur < end) {
return parseFail(input, cur, end);
}

Expand Down
9 changes: 9 additions & 0 deletions velox/functions/lib/tests/DateTimeFormatterTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2441,4 +2441,13 @@ TEST_F(SimpleDateTimeFormatterTest, formatWeekOfMonth) {
}
}

TEST_F(SimpleDateTimeFormatterTest, parseUsingPartialInput) {
EXPECT_EQ(
fromTimestampString("2024-08-01"),
parseSimple("2024 08 01 5", "yyyy MM", true).timestamp);
EXPECT_EQ(
fromTimestampString("2024-08-01"),
parseSimple("2024 08 01 5", "yyyy MM", false).timestamp);
}

} // namespace facebook::velox::functions

0 comments on commit d7a7124

Please sign in to comment.