Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44717][PYTHON][PS] Respect TimestampNTZ in resampling #42392

Closed
wants to merge 2 commits into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to respect TimestampNTZ type in resampling at pandas API on Spark.

Why are the changes needed?

It still operates as if the timestamps are TIMESTAMP_LTZ even when spark.sql.timestampType is set to TIMESTAMP_NTZ, which is unexpected.

Does this PR introduce any user-facing change?

This fixes a bug so end users can use exactly same behaviour with pandas with TimestampNTZType - pandas does not respect the local timezone with DST. While we might need to follow this even for TimestampType, this PR does not address the case as it might be controversial.

How was this patch tested?

Unittest was added.

@HyukjinKwon
Copy link
Member Author

cc @zhengruifeng and @attilapiros FYI

@HyukjinKwon
Copy link
Member Author

cc @gengliangwang too

Comment on lines 265 to 266
def test_series_resample(self):
self.check_series_resample()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are still depending on the TZ environment setting indirectly. So just by running the test on a different TZ this test (and also the test_dataframe_resample) would simply fail.

We should either guarantee the correct TZ in the beginning of the test or validate the assumption and produce a meaningful error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon this is still an issue. The old tests still are failing when the TZ is not UTC i.e in America/New_York.

See https://github.com/attilapiros/spark/actions/runs/5884585704/job/15960073184?pr=5

Anyway for production we have the "spark.sql.timestampType=TIMESTAMP_NTZ" settings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ... this PR enables spark.sql.timestampType=TIMESTAMP_NTZ as a workaround for now ..
To fix this, we need a bigger scope of change .. and can be arguable in a way.

class ResampleWithTimezoneTests(
ResampleWithTimezoneMixin, PandasOnSparkTestUtils, TestUtils, ReusedConnectTestCase
):
@unittest.skip("SPARK-44731: Support 'spark.sql.timestampType' in Python Spark Connect client")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ueshin FYI

@HyukjinKwon
Copy link
Member Author

Merged to master and branch-3.5.

HyukjinKwon added a commit that referenced this pull request Aug 9, 2023
### What changes were proposed in this pull request?

This PR proposes to respect `TimestampNTZ` type in resampling at pandas API on Spark.

### Why are the changes needed?

It still operates as if the timestamps are `TIMESTAMP_LTZ` even when `spark.sql.timestampType` is set to `TIMESTAMP_NTZ`, which is unexpected.

### Does this PR introduce _any_ user-facing change?

This fixes a bug so end users can use exactly same behaviour with pandas with `TimestampNTZType` - pandas does not respect the local timezone with DST. While we might need to follow this even for `TimestampType`, this PR does not address the case as it might be controversial.

### How was this patch tested?

Unittest was added.

Closes #42392 from HyukjinKwon/SPARK-44717.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit e05959e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
zhengruifeng pushed a commit that referenced this pull request Aug 11, 2023
…in Python Spark Connect

### What changes were proposed in this pull request?

This PR proposes:
- Share the namespaces for `to_timestamp_ntz`, `to_timestamp_ltz` and `to_unix_timestamp` in Spark Connect. They were missed.
- Adds the support of `TimestampNTZ` for literal handling in Python Spark Connect (by respecting `spark.sql.timestampType`).

### Why are the changes needed?

For feature parity, and respect timestamp ntz in resampling in pandas API on Spark

### Does this PR introduce _any_ user-facing change?

Yes, this virtually fixes the same bug: #42392 in Spark Connect with Python.

### How was this patch tested?

Unittests reenabled.

Closes #42445 from HyukjinKwon/SPARK-44731.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
zhengruifeng pushed a commit that referenced this pull request Aug 11, 2023
…in Python Spark Connect

### What changes were proposed in this pull request?

This PR proposes:
- Share the namespaces for `to_timestamp_ntz`, `to_timestamp_ltz` and `to_unix_timestamp` in Spark Connect. They were missed.
- Adds the support of `TimestampNTZ` for literal handling in Python Spark Connect (by respecting `spark.sql.timestampType`).

### Why are the changes needed?

For feature parity, and respect timestamp ntz in resampling in pandas API on Spark

### Does this PR introduce _any_ user-facing change?

Yes, this virtually fixes the same bug: #42392 in Spark Connect with Python.

### How was this patch tested?

Unittests reenabled.

Closes #42445 from HyukjinKwon/SPARK-44731.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
(cherry picked from commit 73b0376)
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
hvanhovell pushed a commit to hvanhovell/spark that referenced this pull request Aug 13, 2023
…in Python Spark Connect

### What changes were proposed in this pull request?

This PR proposes:
- Share the namespaces for `to_timestamp_ntz`, `to_timestamp_ltz` and `to_unix_timestamp` in Spark Connect. They were missed.
- Adds the support of `TimestampNTZ` for literal handling in Python Spark Connect (by respecting `spark.sql.timestampType`).

### Why are the changes needed?

For feature parity, and respect timestamp ntz in resampling in pandas API on Spark

### Does this PR introduce _any_ user-facing change?

Yes, this virtually fixes the same bug: apache#42392 in Spark Connect with Python.

### How was this patch tested?

Unittests reenabled.

Closes apache#42445 from HyukjinKwon/SPARK-44731.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
valentinp17 pushed a commit to valentinp17/spark that referenced this pull request Aug 24, 2023
### What changes were proposed in this pull request?

This PR proposes to respect `TimestampNTZ` type in resampling at pandas API on Spark.

### Why are the changes needed?

It still operates as if the timestamps are `TIMESTAMP_LTZ` even when `spark.sql.timestampType` is set to `TIMESTAMP_NTZ`, which is unexpected.

### Does this PR introduce _any_ user-facing change?

This fixes a bug so end users can use exactly same behaviour with pandas with `TimestampNTZType` - pandas does not respect the local timezone with DST. While we might need to follow this even for `TimestampType`, this PR does not address the case as it might be controversial.

### How was this patch tested?

Unittest was added.

Closes apache#42392 from HyukjinKwon/SPARK-44717.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
valentinp17 pushed a commit to valentinp17/spark that referenced this pull request Aug 24, 2023
…in Python Spark Connect

### What changes were proposed in this pull request?

This PR proposes:
- Share the namespaces for `to_timestamp_ntz`, `to_timestamp_ltz` and `to_unix_timestamp` in Spark Connect. They were missed.
- Adds the support of `TimestampNTZ` for literal handling in Python Spark Connect (by respecting `spark.sql.timestampType`).

### Why are the changes needed?

For feature parity, and respect timestamp ntz in resampling in pandas API on Spark

### Does this PR introduce _any_ user-facing change?

Yes, this virtually fixes the same bug: apache#42392 in Spark Connect with Python.

### How was this patch tested?

Unittests reenabled.

Closes apache#42445 from HyukjinKwon/SPARK-44731.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
@HyukjinKwon HyukjinKwon deleted the SPARK-44717 branch January 15, 2024 00:48
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
### What changes were proposed in this pull request?

This PR proposes to respect `TimestampNTZ` type in resampling at pandas API on Spark.

### Why are the changes needed?

It still operates as if the timestamps are `TIMESTAMP_LTZ` even when `spark.sql.timestampType` is set to `TIMESTAMP_NTZ`, which is unexpected.

### Does this PR introduce _any_ user-facing change?

This fixes a bug so end users can use exactly same behaviour with pandas with `TimestampNTZType` - pandas does not respect the local timezone with DST. While we might need to follow this even for `TimestampType`, this PR does not address the case as it might be controversial.

### How was this patch tested?

Unittest was added.

Closes apache#42392 from HyukjinKwon/SPARK-44717.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
ragnarok56 pushed a commit to ragnarok56/spark that referenced this pull request Mar 2, 2024
…in Python Spark Connect

### What changes were proposed in this pull request?

This PR proposes:
- Share the namespaces for `to_timestamp_ntz`, `to_timestamp_ltz` and `to_unix_timestamp` in Spark Connect. They were missed.
- Adds the support of `TimestampNTZ` for literal handling in Python Spark Connect (by respecting `spark.sql.timestampType`).

### Why are the changes needed?

For feature parity, and respect timestamp ntz in resampling in pandas API on Spark

### Does this PR introduce _any_ user-facing change?

Yes, this virtually fixes the same bug: apache#42392 in Spark Connect with Python.

### How was this patch tested?

Unittests reenabled.

Closes apache#42445 from HyukjinKwon/SPARK-44731.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants