-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26887][SQL][PYTHON] Create datetime.date directly instead of creating datetime64[ns] as intermediate data. #23795
Conversation
…ntermediate data.
Test build #102383 has finished for PR 23795 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
pdf[field.name] = _check_series_convert_date(pdf[field.name], field.dataType) | ||
return pdf | ||
# Since Arrow 0.11.0, support date_as_object to return datetime.date instead of np.datetime64. | ||
if LooseVersion(pa.__version__) < LooseVersion("0.11.0"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @ueshin.
@ueshin, @BryanCutler , BTW, which version of PyArrow do you think we should bump up to in Spark 3.0.0? I was thinking about matching it to 0.12.0, or 0.11.0. I think it's overhead that we should test all the pyarrow versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to bump to 0.12.0 because I think that would allow us to clean up the code the most, but since it's a raised error if the user doesn't have that version, it might too restrictive. Let's definitely make a JIRA to discuss more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor (optional) suggestions about comments to make it clearer for the future, thanks for working on this :)
python/pyspark/sql/types.py
Outdated
""" Convert Arrow Column to pandas Series. | ||
|
||
If the given column is a date type column, creates a series of datetime.date directly instead | ||
of creating datetime64[ns] as intermediate data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I think these details belong as a comment internally rather than in the doc string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to say that for dates this will return datetime.date
, but yeah maybe move the part about datetime[64] as intermediate to an internal comment. _arrow_table_to_pandas
has a comment that the reason for this is to match pyspark w/o arrow, but maybe it would be good to add here as well.
python/pyspark/sql/types.py
Outdated
# As of Arrow 0.12.0, date_as_objects is True by default, see ARROW-3910 | ||
if LooseVersion(pyarrow.__version__) < LooseVersion("0.12.0") and type(data_type) == DateType: | ||
return series.dt.date | ||
# Since Arrow 0.11.0, support date_as_object to return datetime.date instead of np.datetime64. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Include a comment about the overflow here so we know why we are avoiding np.datetime64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @ueshin !
python/pyspark/sql/types.py
Outdated
""" Convert Arrow Column to pandas Series. | ||
|
||
If the given column is a date type column, creates a series of datetime.date directly instead | ||
of creating datetime64[ns] as intermediate data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to say that for dates this will return datetime.date
, but yeah maybe move the part about datetime[64] as intermediate to an internal comment. _arrow_table_to_pandas
has a comment that the reason for this is to match pyspark w/o arrow, but maybe it would be good to add here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me too anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I'm merging this. Last commits were just about fixing comments. PEP8 check is already passed. |
Merged to master. |
Test build #102441 has finished for PR 23795 at commit
|
Test build #102442 has finished for PR 23795 at commit
|
…of creating datetime64 as intermediate data. ## What changes were proposed in this pull request? Currently `DataFrame.toPandas()` with arrow enabled or `ArrowStreamPandasSerializer` for pandas UDF with pyarrow<0.12 creates `datetime64[ns]` type series as intermediate data and then convert to `datetime.date` series, but the intermediate `datetime64[ns]` might cause an overflow even if the date is valid. ``` >>> import datetime >>> >>> t = [datetime.date(2262, 4, 12), datetime.date(2263, 4, 12)] >>> >>> df = spark.createDataFrame(t, 'date') >>> df.show() +----------+ | value| +----------+ |2262-04-12| |2263-04-12| +----------+ >>> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> >>> df.toPandas() value 0 1677-09-21 1 1678-09-21 ``` We should avoid creating such intermediate data and create `datetime.date` series directly instead. ## How was this patch tested? Modified some tests to include the date which overflow caused by the intermediate conversion. Run tests with pyarrow 0.8, 0.10, 0.11, 0.12 in my local environment. Closes apache#23795 from ueshin/issues/SPARK-26887/date_as_object. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…of creating datetime64 as intermediate data. ## What changes were proposed in this pull request? Currently `DataFrame.toPandas()` with arrow enabled or `ArrowStreamPandasSerializer` for pandas UDF with pyarrow<0.12 creates `datetime64[ns]` type series as intermediate data and then convert to `datetime.date` series, but the intermediate `datetime64[ns]` might cause an overflow even if the date is valid. ``` >>> import datetime >>> >>> t = [datetime.date(2262, 4, 12), datetime.date(2263, 4, 12)] >>> >>> df = spark.createDataFrame(t, 'date') >>> df.show() +----------+ | value| +----------+ |2262-04-12| |2263-04-12| +----------+ >>> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> >>> df.toPandas() value 0 1677-09-21 1 1678-09-21 ``` We should avoid creating such intermediate data and create `datetime.date` series directly instead. ## How was this patch tested? Modified some tests to include the date which overflow caused by the intermediate conversion. Run tests with pyarrow 0.8, 0.10, 0.11, 0.12 in my local environment. Closes apache#23795 from ueshin/issues/SPARK-26887/date_as_object. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…of creating datetime64 as intermediate data. ## What changes were proposed in this pull request? Currently `DataFrame.toPandas()` with arrow enabled or `ArrowStreamPandasSerializer` for pandas UDF with pyarrow<0.12 creates `datetime64[ns]` type series as intermediate data and then convert to `datetime.date` series, but the intermediate `datetime64[ns]` might cause an overflow even if the date is valid. ``` >>> import datetime >>> >>> t = [datetime.date(2262, 4, 12), datetime.date(2263, 4, 12)] >>> >>> df = spark.createDataFrame(t, 'date') >>> df.show() +----------+ | value| +----------+ |2262-04-12| |2263-04-12| +----------+ >>> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> >>> df.toPandas() value 0 1677-09-21 1 1678-09-21 ``` We should avoid creating such intermediate data and create `datetime.date` series directly instead. ## How was this patch tested? Modified some tests to include the date which overflow caused by the intermediate conversion. Run tests with pyarrow 0.8, 0.10, 0.11, 0.12 in my local environment. Closes apache#23795 from ueshin/issues/SPARK-26887/date_as_object. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Currently
DataFrame.toPandas()
with arrow enabled orArrowStreamPandasSerializer
for pandas UDF with pyarrow<0.12 createsdatetime64[ns]
type series as intermediate data and then convert todatetime.date
series, but the intermediatedatetime64[ns]
might cause an overflow even if the date is valid.We should avoid creating such intermediate data and create
datetime.date
series directly instead.How was this patch tested?
Modified some tests to include the date which overflow caused by the intermediate conversion.
Run tests with pyarrow 0.8, 0.10, 0.11, 0.12 in my local environment.