Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26887][SQL][PYTHON] Create datetime.date directly instead of creating datetime64[ns] as intermediate data. #23795

Closed

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Feb 15, 2019

What changes were proposed in this pull request?

Currently DataFrame.toPandas() with arrow enabled or ArrowStreamPandasSerializer for pandas UDF with pyarrow<0.12 creates datetime64[ns] type series as intermediate data and then convert to datetime.date series, but the intermediate datetime64[ns] might cause an overflow even if the date is valid.

>>> import datetime
>>>
>>> t = [datetime.date(2262, 4, 12), datetime.date(2263, 4, 12)]
>>>
>>> df = spark.createDataFrame(t, 'date')
>>> df.show()
+----------+
|     value|
+----------+
|2262-04-12|
|2263-04-12|
+----------+

>>>
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>>>
>>> df.toPandas()
        value
0  1677-09-21
1  1678-09-21

We should avoid creating such intermediate data and create datetime.date series directly instead.

How was this patch tested?

Modified some tests to include the date which overflow caused by the intermediate conversion.
Run tests with pyarrow 0.8, 0.10, 0.11, 0.12 in my local environment.

@ueshin
Copy link
Member Author

ueshin commented Feb 15, 2019

@SparkQA
Copy link

SparkQA commented Feb 15, 2019

Test build #102383 has finished for PR 23795 at commit 8ac7925.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pdf[field.name] = _check_series_convert_date(pdf[field.name], field.dataType)
return pdf
# Since Arrow 0.11.0, support date_as_object to return datetime.date instead of np.datetime64.
if LooseVersion(pa.__version__) < LooseVersion("0.11.0"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @ueshin.

@ueshin, @BryanCutler , BTW, which version of PyArrow do you think we should bump up to in Spark 3.0.0? I was thinking about matching it to 0.12.0, or 0.11.0. I think it's overhead that we should test all the pyarrow versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to bump to 0.12.0 because I think that would allow us to clean up the code the most, but since it's a raised error if the user doesn't have that version, it might too restrictive. Let's definitely make a JIRA to discuss more.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor (optional) suggestions about comments to make it clearer for the future, thanks for working on this :)

""" Convert Arrow Column to pandas Series.

If the given column is a date type column, creates a series of datetime.date directly instead
of creating datetime64[ns] as intermediate data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I think these details belong as a comment internally rather than in the doc string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to say that for dates this will return datetime.date, but yeah maybe move the part about datetime[64] as intermediate to an internal comment. _arrow_table_to_pandas has a comment that the reason for this is to match pyspark w/o arrow, but maybe it would be good to add here as well.

# As of Arrow 0.12.0, date_as_objects is True by default, see ARROW-3910
if LooseVersion(pyarrow.__version__) < LooseVersion("0.12.0") and type(data_type) == DateType:
return series.dt.date
# Since Arrow 0.11.0, support date_as_object to return datetime.date instead of np.datetime64.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include a comment about the overflow here so we know why we are avoiding np.datetime64.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @ueshin !

""" Convert Arrow Column to pandas Series.

If the given column is a date type column, creates a series of datetime.date directly instead
of creating datetime64[ns] as intermediate data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to say that for dates this will return datetime.date, but yeah maybe move the part about datetime[64] as intermediate to an internal comment. _arrow_table_to_pandas has a comment that the reason for this is to match pyspark w/o arrow, but maybe it would be good to add here as well.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me too anyway.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

I'm merging this. Last commits were just about fixing comments. PEP8 check is already passed.

@HyukjinKwon
Copy link
Member

Merged to master.

@SparkQA
Copy link

SparkQA commented Feb 18, 2019

Test build #102441 has finished for PR 23795 at commit b93b500.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 18, 2019

Test build #102442 has finished for PR 23795 at commit cca2634.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…of creating datetime64 as intermediate data.

## What changes were proposed in this pull request?

Currently `DataFrame.toPandas()` with arrow enabled or `ArrowStreamPandasSerializer` for pandas UDF with pyarrow<0.12 creates `datetime64[ns]` type series as intermediate data and then convert to `datetime.date` series, but the intermediate `datetime64[ns]` might cause an overflow even if the date is valid.

```
>>> import datetime
>>>
>>> t = [datetime.date(2262, 4, 12), datetime.date(2263, 4, 12)]
>>>
>>> df = spark.createDataFrame(t, 'date')
>>> df.show()
+----------+
|     value|
+----------+
|2262-04-12|
|2263-04-12|
+----------+

>>>
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>>>
>>> df.toPandas()
        value
0  1677-09-21
1  1678-09-21
```

We should avoid creating such intermediate data and create `datetime.date` series directly instead.

## How was this patch tested?

Modified some tests to include the date which overflow caused by the intermediate conversion.
Run tests with pyarrow 0.8, 0.10, 0.11, 0.12 in my local environment.

Closes apache#23795 from ueshin/issues/SPARK-26887/date_as_object.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
simon-slowik pushed a commit to simon-slowik/spark that referenced this pull request Jun 26, 2019
…of creating datetime64 as intermediate data.

## What changes were proposed in this pull request?

Currently `DataFrame.toPandas()` with arrow enabled or `ArrowStreamPandasSerializer` for pandas UDF with pyarrow<0.12 creates `datetime64[ns]` type series as intermediate data and then convert to `datetime.date` series, but the intermediate `datetime64[ns]` might cause an overflow even if the date is valid.

```
>>> import datetime
>>>
>>> t = [datetime.date(2262, 4, 12), datetime.date(2263, 4, 12)]
>>>
>>> df = spark.createDataFrame(t, 'date')
>>> df.show()
+----------+
|     value|
+----------+
|2262-04-12|
|2263-04-12|
+----------+

>>>
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>>>
>>> df.toPandas()
        value
0  1677-09-21
1  1678-09-21
```

We should avoid creating such intermediate data and create `datetime.date` series directly instead.

## How was this patch tested?

Modified some tests to include the date which overflow caused by the intermediate conversion.
Run tests with pyarrow 0.8, 0.10, 0.11, 0.12 in my local environment.

Closes apache#23795 from ueshin/issues/SPARK-26887/date_as_object.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
simon-slowik pushed a commit to simon-slowik/spark that referenced this pull request Jun 26, 2019
…of creating datetime64 as intermediate data.

## What changes were proposed in this pull request?

Currently `DataFrame.toPandas()` with arrow enabled or `ArrowStreamPandasSerializer` for pandas UDF with pyarrow<0.12 creates `datetime64[ns]` type series as intermediate data and then convert to `datetime.date` series, but the intermediate `datetime64[ns]` might cause an overflow even if the date is valid.

```
>>> import datetime
>>>
>>> t = [datetime.date(2262, 4, 12), datetime.date(2263, 4, 12)]
>>>
>>> df = spark.createDataFrame(t, 'date')
>>> df.show()
+----------+
|     value|
+----------+
|2262-04-12|
|2263-04-12|
+----------+

>>>
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>>>
>>> df.toPandas()
        value
0  1677-09-21
1  1678-09-21
```

We should avoid creating such intermediate data and create `datetime.date` series directly instead.

## How was this patch tested?

Modified some tests to include the date which overflow caused by the intermediate conversion.
Run tests with pyarrow 0.8, 0.10, 0.11, 0.12 in my local environment.

Closes apache#23795 from ueshin/issues/SPARK-26887/date_as_object.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants