-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enh: add dt.timestamp #1220
enh: add dt.timestamp #1220
Conversation
…k pyarrow types to to_date test
narwhals/_arrow/series.py
Outdated
if time_unit == "ns": | ||
result = s_cast | ||
if time_unit == "us": | ||
result = pc.divide(s_cast, 1_000) | ||
if time_unit == "ms": | ||
result = pc.divide(s_cast, 1_000_000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for all of these, it would be marginally more efficient to use elif
, else we're doing similar comparisons
we're already validating time_unit
in narwhals/expr.py
, so this could just be
if time_unit == "ns":
result = pc.multiply(s_cast, 1_000)
elif time_unit == "us":
result = s_cast
else:
result = pc.divide(s_cast, 1_000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @raisadz , serious effort here, looks really nice, and this is a really valuable feature!
I'd suggest adding in a hypothesis test, something like
from hypothesis import given
import hypothesis.strategies as st
@given(
inputs=st.datetimes(min_value=datetime(1960,1,1), max_value=datetime(1980,1,1)),
time_unit=st.sampled_from(['ms', 'us', 'ns']),
# We keep 'ms' out for now due to an upstream bug: https://github.com/pola-rs/polars/issues/19309
starting_time_unit=st.sampled_from(['us', 'ns'])
)
def test_me(inputs, time_unit, starting_time_unit) -> None:
import polars as pl
import pandas as pd
import pyarrow as pa
@nw.narwhalify
def func(s):
return s.dt.timestamp(time_unit)
result_pl = func(pl.Series([inputs], dtype=pl.Datetime(starting_time_unit)))
result_pd = func(pd.Series([inputs], dtype=f'datetime64[{starting_time_unit}]'))
result_pdpa = func(pd.Series([inputs], dtype=f'timestamp[{starting_time_unit}][pyarrow]'))
result_pa = func(pa.chunked_array([[inputs]], type=pa.timestamp(starting_time_unit)))
assert result_pl[0] == result_pd[0]
assert result_pl[0] == result_pdpa[0]
assert result_pl[0] == result_pa[0].as_py()
(but with a better name 😅 )
narwhals/_arrow/series.py
Outdated
result = s_cast | ||
if time_unit == "ms": | ||
result = pc.divide(s_cast, 1_000) | ||
if unit == "ms": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same with these ifs
narwhals/_arrow/series.py
Outdated
if time_unit == "ns": | ||
result = s_cast | ||
if time_unit == "us": | ||
result = pc.divide(s_cast, 1_000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we'll need floordiv_compat
instead of divide
so it's consistent with how we do it for the other backends
I am catching up with Marco's today's livestream - as he explains how the integer representation works, wouldn't it make sense to have something like the following pseudocode (details such as nulls, time units, time zones, date vs datetime, etc need to be taken care) diff = series - datetime(1970, 1, 1, 0, 0, 0, tzinfo=...)
if time_unit == "ns":
return diff.dt.total_nanoseconds()
elif ...
... Maybe it ends up being similar effort, but feels less redundant than all the nested Edit: pyarrow also has |
@nw.narwhalify | ||
def func(s: nw.Series) -> nw.Series: | ||
return s.dt.timestamp(time_unit) # type: ignore[return-value] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the type ignore is due to #1229
thanks @FBruzzesi for taking a look! |
I am most likely missing something - what I am suggesting is, instead of using the date(time) representation, compute the difference between the series/expr of interest and the scalar 1970-01-01 00:00:00 (unix_epoch). This would result in a Duration object (wouldn't it?) and then use the Duration methods. Sorry if I misunderstood the internal representation explanation |
ah sorry you're right 🤦 maybe that's indeed simpler, will try! |
i think a cast would always be more efficient, from a little timing test:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing stuff 🔥 I left a few comments for me to learn 🙈
("s", "ms", [978307200000, None, 978480000000]), | ||
], | ||
) | ||
def test_timestamp_datetimes_tz_aware( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one 👌
narwhals/series.py
Outdated
@@ -4001,3 +4001,63 @@ def convert_time_zone(self, time_zone: str) -> Series: | |||
return self._narwhals_series._from_compliant_series( | |||
self._narwhals_series._compliant_series.dt.convert_time_zone(time_zone) | |||
) | |||
|
|||
def timestamp(self, time_unit: Literal["ns", "us", "ms"] = "us") -> Series: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heads up if merging with main branch:
def timestamp(self, time_unit: Literal["ns", "us", "ms"] = "us") -> Series: | |
def timestamp(self: Self, time_unit: Literal["ns", "us", "ms"] = "us") -> T: |
(and same in Expr)
is_pyarrow_dtype = "pyarrow" in str(self._pandas_series._native_series.dtype) | ||
mask_na = s.isna() | ||
if dtype == self._pandas_series._dtypes.Date: | ||
s_cast = s.astype("Int32[pyarrow]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because date can only be pyarrow backed, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup! will add a comment
s_cast = s.view("Int64[pyarrow]") if is_pyarrow_dtype else s.view("int64") | ||
else: | ||
s_cast = ( | ||
s.astype("Int64[pyarrow]") if is_pyarrow_dtype else s.astype("int64") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume .view
is more efficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the most ridiculous thing - pandas 1.5 emits a warning telling you to use view
instead of astype
, and pandas 2.x emits a warning telling you to use astype
instead of view
because view
is deprecated 🤦 🤦 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @raisadz , and @FBruzzesi for reviewing!
What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below.