[SPARK-26887][SQL][PYTHON] Create datetime.date directly instead of creating datetime64[ns] as intermediate data. #23795

ueshin · 2019-02-15T08:54:55Z

What changes were proposed in this pull request?

Currently DataFrame.toPandas() with arrow enabled or ArrowStreamPandasSerializer for pandas UDF with pyarrow<0.12 creates datetime64[ns] type series as intermediate data and then convert to datetime.date series, but the intermediate datetime64[ns] might cause an overflow even if the date is valid.

>>> import datetime
>>>
>>> t = [datetime.date(2262, 4, 12), datetime.date(2263, 4, 12)]
>>>
>>> df = spark.createDataFrame(t, 'date')
>>> df.show()
+----------+
|     value|
+----------+
|2262-04-12|
|2263-04-12|
+----------+

>>>
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>>>
>>> df.toPandas()
        value
0  1677-09-21
1  1678-09-21

We should avoid creating such intermediate data and create datetime.date series directly instead.

How was this patch tested?

Modified some tests to include the date which overflow caused by the intermediate conversion.
Run tests with pyarrow 0.8, 0.10, 0.11, 0.12 in my local environment.

…ntermediate data.

ueshin · 2019-02-15T08:55:53Z

cc @BryanCutler @HyukjinKwon @gatorsmile @sadikovi

SparkQA · 2019-02-15T09:31:08Z

Test build #102383 has finished for PR 23795 at commit 8ac7925.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sadikovi

LGTM

HyukjinKwon · 2019-02-15T11:26:42Z

python/pyspark/sql/types.py

-            pdf[field.name] = _check_series_convert_date(pdf[field.name], field.dataType)
-    return pdf
+    # Since Arrow 0.11.0, support date_as_object to return datetime.date instead of np.datetime64.
+    if LooseVersion(pa.__version__) < LooseVersion("0.11.0"):


Looks good @ueshin.

@ueshin, @BryanCutler , BTW, which version of PyArrow do you think we should bump up to in Spark 3.0.0? I was thinking about matching it to 0.12.0, or 0.11.0. I think it's overhead that we should test all the pyarrow versions.

It would be nice to bump to 0.12.0 because I think that would allow us to clean up the code the most, but since it's a raised error if the user doesn't have that version, it might too restrictive. Let's definitely make a JIRA to discuss more.

holdenk

Minor (optional) suggestions about comments to make it clearer for the future, thanks for working on this :)

holdenk · 2019-02-16T01:44:30Z

python/pyspark/sql/types.py

+    """ Convert Arrow Column to pandas Series.
+
+    If the given column is a date type column, creates a series of datetime.date directly instead
+    of creating datetime64[ns] as intermediate data.


minor: I think these details belong as a comment internally rather than in the doc string.

It would be nice to say that for dates this will return datetime.date, but yeah maybe move the part about datetime[64] as intermediate to an internal comment. _arrow_table_to_pandas has a comment that the reason for this is to match pyspark w/o arrow, but maybe it would be good to add here as well.

holdenk · 2019-02-16T01:46:49Z

python/pyspark/sql/types.py

-    # As of Arrow 0.12.0, date_as_objects is True by default, see ARROW-3910
-    if LooseVersion(pyarrow.__version__) < LooseVersion("0.12.0") and type(data_type) == DateType:
-        return series.dt.date
+    # Since Arrow 0.11.0, support date_as_object to return datetime.date instead of np.datetime64.


Include a comment about the overflow here so we know why we are avoiding np.datetime64.

BryanCutler

LGTM, thanks @ueshin !

BryanCutler · 2019-02-16T04:58:13Z

python/pyspark/sql/types.py

+    """ Convert Arrow Column to pandas Series.
+
+    If the given column is a date type column, creates a series of datetime.date directly instead
+    of creating datetime64[ns] as intermediate data.


It would be nice to say that for dates this will return datetime.date, but yeah maybe move the part about datetime[64] as intermediate to an internal comment. _arrow_table_to_pandas has a comment that the reason for this is to match pyspark w/o arrow, but maybe it would be good to add here as well.

HyukjinKwon

Looks good to me too anyway.

HyukjinKwon

LGTM

HyukjinKwon · 2019-02-18T03:47:31Z

I'm merging this. Last commits were just about fixing comments. PEP8 check is already passed.

HyukjinKwon · 2019-02-18T03:47:42Z

Merged to master.

SparkQA · 2019-02-18T04:10:11Z

Test build #102441 has finished for PR 23795 at commit b93b500.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-18T04:15:57Z

Test build #102442 has finished for PR 23795 at commit cca2634.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…of creating datetime64 as intermediate data. ## What changes were proposed in this pull request? Currently `DataFrame.toPandas()` with arrow enabled or `ArrowStreamPandasSerializer` for pandas UDF with pyarrow<0.12 creates `datetime64[ns]` type series as intermediate data and then convert to `datetime.date` series, but the intermediate `datetime64[ns]` might cause an overflow even if the date is valid. ``` >>> import datetime >>> >>> t = [datetime.date(2262, 4, 12), datetime.date(2263, 4, 12)] >>> >>> df = spark.createDataFrame(t, 'date') >>> df.show() +----------+ | value| +----------+ |2262-04-12| |2263-04-12| +----------+ >>> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> >>> df.toPandas() value 0 1677-09-21 1 1678-09-21 ``` We should avoid creating such intermediate data and create `datetime.date` series directly instead. ## How was this patch tested? Modified some tests to include the date which overflow caused by the intermediate conversion. Run tests with pyarrow 0.8, 0.10, 0.11, 0.12 in my local environment. Closes apache#23795 from ueshin/issues/SPARK-26887/date_as_object. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Create datetime.date directly instead of creating datetime64[ns] as i…

8ac7925

…ntermediate data.

sadikovi approved these changes Feb 15, 2019

View reviewed changes

HyukjinKwon reviewed Feb 15, 2019

View reviewed changes

holdenk reviewed Feb 16, 2019

View reviewed changes

BryanCutler reviewed Feb 16, 2019

View reviewed changes

HyukjinKwon approved these changes Feb 16, 2019

View reviewed changes

ueshin added 2 commits February 18, 2019 12:24

Address comments.

b93b500

Fix.

cca2634

HyukjinKwon approved these changes Feb 18, 2019

View reviewed changes

HyukjinKwon closed this in 4a4e7ae Feb 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-26887][SQL][PYTHON] Create datetime.date directly instead of creating datetime64[ns] as intermediate data. #23795

[SPARK-26887][SQL][PYTHON] Create datetime.date directly instead of creating datetime64[ns] as intermediate data. #23795

ueshin commented Feb 15, 2019

ueshin commented Feb 15, 2019

SparkQA commented Feb 15, 2019

sadikovi left a comment

HyukjinKwon Feb 15, 2019

BryanCutler Feb 16, 2019

holdenk left a comment

holdenk Feb 16, 2019

BryanCutler Feb 16, 2019

holdenk Feb 16, 2019

BryanCutler left a comment

BryanCutler Feb 16, 2019

HyukjinKwon left a comment

HyukjinKwon left a comment

HyukjinKwon commented Feb 18, 2019

HyukjinKwon commented Feb 18, 2019

SparkQA commented Feb 18, 2019

SparkQA commented Feb 18, 2019

[SPARK-26887][SQL][PYTHON] Create datetime.date directly instead of creating datetime64[ns] as intermediate data. #23795

[SPARK-26887][SQL][PYTHON] Create datetime.date directly instead of creating datetime64[ns] as intermediate data. #23795

Conversation

ueshin commented Feb 15, 2019

What changes were proposed in this pull request?

How was this patch tested?

ueshin commented Feb 15, 2019

SparkQA commented Feb 15, 2019

sadikovi left a comment

Choose a reason for hiding this comment

HyukjinKwon Feb 15, 2019

Choose a reason for hiding this comment

BryanCutler Feb 16, 2019

Choose a reason for hiding this comment

holdenk left a comment

Choose a reason for hiding this comment

holdenk Feb 16, 2019

Choose a reason for hiding this comment

BryanCutler Feb 16, 2019

Choose a reason for hiding this comment

holdenk Feb 16, 2019

Choose a reason for hiding this comment

BryanCutler left a comment

Choose a reason for hiding this comment

BryanCutler Feb 16, 2019

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Feb 18, 2019

HyukjinKwon commented Feb 18, 2019

SparkQA commented Feb 18, 2019

SparkQA commented Feb 18, 2019