Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enh: add dt.timestamp #1220

Merged
merged 29 commits into from
Oct 21, 2024
Merged

Conversation

raisadz
Copy link
Contributor

@raisadz raisadz commented Oct 18, 2024

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below.

@github-actions github-actions bot added the enhancement New feature or request label Oct 18, 2024
Comment on lines 793 to 798
if time_unit == "ns":
result = s_cast
if time_unit == "us":
result = pc.divide(s_cast, 1_000)
if time_unit == "ms":
result = pc.divide(s_cast, 1_000_000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for all of these, it would be marginally more efficient to use elif, else we're doing similar comparisons

we're already validating time_unit in narwhals/expr.py, so this could just be

                if time_unit == "ns":
                    result = pc.multiply(s_cast, 1_000)
                elif time_unit == "us":
                    result = s_cast
                else:
                    result = pc.divide(s_cast, 1_000)

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @raisadz , serious effort here, looks really nice, and this is a really valuable feature!

I'd suggest adding in a hypothesis test, something like

from hypothesis import given
import hypothesis.strategies as st

@given(
    inputs=st.datetimes(min_value=datetime(1960,1,1), max_value=datetime(1980,1,1)),
    time_unit=st.sampled_from(['ms', 'us', 'ns']),
    # We keep 'ms' out for now due to an upstream bug: https://github.com/pola-rs/polars/issues/19309
    starting_time_unit=st.sampled_from(['us', 'ns'])
)
def test_me(inputs, time_unit, starting_time_unit) -> None:
    import polars as pl
    import pandas as pd
    import pyarrow as pa
    @nw.narwhalify
    def func(s):
        return s.dt.timestamp(time_unit)
    result_pl = func(pl.Series([inputs], dtype=pl.Datetime(starting_time_unit)))
    result_pd = func(pd.Series([inputs], dtype=f'datetime64[{starting_time_unit}]'))
    result_pdpa = func(pd.Series([inputs], dtype=f'timestamp[{starting_time_unit}][pyarrow]'))
    result_pa = func(pa.chunked_array([[inputs]], type=pa.timestamp(starting_time_unit)))
    assert result_pl[0] == result_pd[0]
    assert result_pl[0] == result_pdpa[0]
    assert result_pl[0] == result_pa[0].as_py()

(but with a better name 😅 )

result = s_cast
if time_unit == "ms":
result = pc.divide(s_cast, 1_000)
if unit == "ms":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with these ifs

if time_unit == "ns":
result = s_cast
if time_unit == "us":
result = pc.divide(s_cast, 1_000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we'll need floordiv_compat instead of divide so it's consistent with how we do it for the other backends

@FBruzzesi
Copy link
Member

FBruzzesi commented Oct 18, 2024

I am catching up with Marco's today's livestream - as he explains how the integer representation works, wouldn't it make sense to have something like the following pseudocode (details such as nulls, time units, time zones, date vs datetime, etc need to be taken care)

diff = series - datetime(1970, 1, 1, 0, 0, 0, tzinfo=...)
if time_unit == "ns":
    return diff.dt.total_nanoseconds()
elif ...
...

Maybe it ends up being similar effort, but feels less redundant than all the nested if cases. I just wanted to throw the idea out there, feel free to ignore it :)

Edit: pyarrow also has nanoseconds_between, microseconds_between and milliseconds_between two date/datetimes.

Comment on lines 141 to 143
@nw.narwhalify
def func(s: nw.Series) -> nw.Series:
return s.dt.timestamp(time_unit) # type: ignore[return-value]
Copy link
Member

@MarcoGorelli MarcoGorelli Oct 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type ignore is due to #1229

@MarcoGorelli
Copy link
Member

thanks @FBruzzesi for taking a look! total_nanoseconds is for Duration, not Datetime, so we can't use it here unfortunately 😅

@FBruzzesi
Copy link
Member

thanks @FBruzzesi for taking a look! total_nanoseconds is for Duration, not Datetime, so we can't use it here unfortunately 😅

I am most likely missing something - what I am suggesting is, instead of using the date(time) representation, compute the difference between the series/expr of interest and the scalar 1970-01-01 00:00:00 (unix_epoch). This would result in a Duration object (wouldn't it?) and then use the Duration methods. Sorry if I misunderstood the internal representation explanation

@MarcoGorelli
Copy link
Member

ah sorry you're right 🤦 maybe that's indeed simpler, will try!

@MarcoGorelli
Copy link
Member

i think a cast would always be more efficient, from a little timing test:

In [50]: %timeit (s - datetime(1970,1,1)).dt.total_seconds()
465 μs ± 64.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [51]: %timeit s.astype('int64')
52.9 μs ± 10.8 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

@MarcoGorelli MarcoGorelli marked this pull request as ready for review October 20, 2024 11:43
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing stuff 🔥 I left a few comments for me to learn 🙈

("s", "ms", [978307200000, None, 978480000000]),
],
)
def test_timestamp_datetimes_tz_aware(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one 👌

@@ -4001,3 +4001,63 @@ def convert_time_zone(self, time_zone: str) -> Series:
return self._narwhals_series._from_compliant_series(
self._narwhals_series._compliant_series.dt.convert_time_zone(time_zone)
)

def timestamp(self, time_unit: Literal["ns", "us", "ms"] = "us") -> Series:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up if merging with main branch:

Suggested change
def timestamp(self, time_unit: Literal["ns", "us", "ms"] = "us") -> Series:
def timestamp(self: Self, time_unit: Literal["ns", "us", "ms"] = "us") -> T:

(and same in Expr)

is_pyarrow_dtype = "pyarrow" in str(self._pandas_series._native_series.dtype)
mask_na = s.isna()
if dtype == self._pandas_series._dtypes.Date:
s_cast = s.astype("Int32[pyarrow]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because date can only be pyarrow backed, is that right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup! will add a comment

Comment on lines +964 to +968
s_cast = s.view("Int64[pyarrow]") if is_pyarrow_dtype else s.view("int64")
else:
s_cast = (
s.astype("Int64[pyarrow]") if is_pyarrow_dtype else s.astype("int64")
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume .view is more efficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the most ridiculous thing - pandas 1.5 emits a warning telling you to use view instead of astype, and pandas 2.x emits a warning telling you to use astype instead of view because view is deprecated 🤦 🤦 🤦

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @raisadz , and @FBruzzesi for reviewing!

@MarcoGorelli MarcoGorelli merged commit c692b81 into narwhals-dev:main Oct 21, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

enh: add dt.timestamp
3 participants