-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Narwhals implementation of from_dataframe
and performance benchmark
#2661
base: master
Are you sure you want to change the base?
Narwhals implementation of from_dataframe
and performance benchmark
#2661
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for giving this a go!
I've left a couple of comments
I suspect the .to_list()
calls may be responsible for the slow-down. I'll take a look
Hi @MarcoGorelli , Thanks for already looking at this and for your insights! |
Hi @MarcoGorelli, I investigated the issue, and it appears that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @authierj for the effort on this! We really appreciate it! I left very non-relevant comments 😂
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2661 +/- ##
==========================================
- Coverage 94.17% 94.08% -0.09%
==========================================
Files 141 141
Lines 15579 15591 +12
==========================================
- Hits 14671 14669 -2
- Misses 908 922 +14 ☔ View full report in Codecov by Sentry. |
…om/authierj/darts into feature/add_timeseries_from_polars
To compare the performance of the methods Averaged over 10 runs, the processing times are as follows:
Therefore, As a consequence of this significant result, I will change the implementation of |
…an optional dependency.
The script used for testing the time performance of the methods is the following:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great PR @authierj 🚀
This will be a great addition to Darts :)
I added a couple of suggestions, mainly on how to further simplify some things here and there, and improve the documentation
narwhals_test_time.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this file and rather have it in the comments. Also, it uses TimeSeries.from_narwhals_dataframe
which is not implemented. So updating it would help :)
nfoursid>=1.0.0 | ||
numpy>=1.19.0,<2.0.0 | ||
pandas>=1.0.5 | ||
polars>=1.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@madtoinou is working currently on adding optional dependencies to our checks. So until he added this, let's remove it from the required dependencies, and adapt the tests to skip (not remove) the polars checks at this time
polars>=1.0.0 |
@@ -569,7 +571,7 @@ def from_csv( | |||
@classmethod | |||
def from_dataframe( | |||
cls, | |||
df: pd.DataFrame, | |||
df: IntoDataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the df description?
df
The DataFrame, or anything which can be converted to a narwhals DataFrame (e.g. pandas.DataFrame,
polars.DataFrame, ...). See the `narwhals documentation
<https://narwhals-dev.github.io/narwhals/api-reference/narwhals/#narwhals.from_native>`_ for more
information.
@@ -1610,32 +1626,20 @@ def pd_dataframe(self, copy=True, suppress_warnings=False) -> pd.DataFrame: | |||
"_s".join((comp_name, str(sample_id))) | |||
for comp_name, sample_id in itertools.product(comp_name, samples) | |||
] | |||
data = self._xa.stack(data=(DIMS[1], DIMS[2])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we can also simplify further to something like below. The goals is to not have so many if/else (especially in return where we currently call "to pandas" 4 times
if not self.is_deterministic:
if not suppress_warnings:
logger.warning(
"You are transforming a stochastic TimeSeries (i.e., contains several samples). "
"The resulting DataFrame is a 2D object with all samples on the columns. "
"If this is not the expected behavior consider calling a function "
"adapted to stochastic TimeSeries like quantile_df()."
)
comp_name = list(self.components)
samples = range(self.n_samples)
columns = [
"_s".join((comp_name, str(sample_id)))
for comp_name, sample_id in itertools.product(comp_name, samples)
]
data = self._xa.stack(data=(DIMS[1], DIMS[2])).values
else:
columns = self._xa.get_index(DIMS[1])
data = self._xa[:, :, 0].values
index = self._time_index
if copy:
columns = columns.copy()
data = data.copy()
index = index.copy()
return pd.DataFrame(data=data, index=index, columns=columns)
) | ||
) | ||
time_index.name = time_col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bring this back?
# BUGFIX : force time-index to be timezone naive as xarray doesn't support it | ||
# pandas.DataFrame loses the tz information if it's not its index | ||
if time_col_vals.dtype.time_zone is not None: | ||
logger.warning( | ||
"The provided Datetime data was associated with a timezone, which is currently not supported " | ||
"by xarray. To avoid unexpected behaviour, the tz information was removed. Consider calling " | ||
f"`ts.time_index.tz_localize({time_col_vals.dtype.time_zone})` when exporting the results." | ||
"To plot the series with the right time steps, consider setting the matplotlib.pyplot " | ||
"`rcParams['timezone']` parameter to automatically convert the time axis back to the " | ||
"original timezone." | ||
) | ||
time_col_vals = time_col_vals.dt.replace_time_zone(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is duplicated here (from below). We can remember the time zone here and adapt further below to simplify things (see my comment below (C1) on what you can do):
# BUGFIX : force time-index to be timezone naive as xarray doesn't support it | |
# pandas.DataFrame loses the tz information if it's not its index | |
if time_col_vals.dtype.time_zone is not None: | |
logger.warning( | |
"The provided Datetime data was associated with a timezone, which is currently not supported " | |
"by xarray. To avoid unexpected behaviour, the tz information was removed. Consider calling " | |
f"`ts.time_index.tz_localize({time_col_vals.dtype.time_zone})` when exporting the results." | |
"To plot the series with the right time steps, consider setting the matplotlib.pyplot " | |
"`rcParams['timezone']` parameter to automatically convert the time axis back to the " | |
"original timezone." | |
) | |
time_col_vals = time_col_vals.dt.replace_time_zone(None) | |
# remove and remember time zone here as polars converts to UTC | |
time_zone = time_col_vals.dtype.time_zone | |
if time_zone is not None: | |
time_col_vals = time_col_vals.dt.replace_time_zone(None) | |
time_index = pd.DatetimeIndex(time_col_vals) |
darts/tests/test_timeseries.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice rewriting of the tests :) 🚀
As mentioned in the requirements/core.txt
comment, we'll add the polars tests as part of our optional dependecies soon. For the meantime, could you remove "polars" from the parametrization, and remove the import polars?
As soon as the optional dependency checks are up, we can add the references back in another PR :)
Co-authored-by: Dennis Bader <dennis.bader@gmx.ch>
Co-authored-by: Dennis Bader <dennis.bader@gmx.ch>
…om/authierj/darts into feature/add_timeseries_from_polars
This is very cool and I'm sure will make many users' lives easier! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @authierj , I left a few suggestions and considerations in the from_*
functions! Hope they help :)
if isinstance(value_cols, str): | ||
value_cols = [value_cols] | ||
series_df = df[value_cols] | ||
series_df = df[value_cols] # quite slow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of an edge case, but pandas columns can also be integers, the safest way to deal with that would be:
if isinstance(value_cols, (str, int)):
value_cols = [value_cols]
series_df = df.select(nw.col(*value_cols))
# Temporarily use an integer Index to sort the values, and replace by a | ||
# RangeIndex in `TimeSeries.from_xarray()` | ||
time_index = pd.Index(time_col_vals) | ||
|
||
elif np.issubdtype(time_col_vals.dtype, object): | ||
elif time_col_vals.dtype == nw.String: | ||
# The integer conversion failed; try datetimes | ||
try: | ||
time_index = pd.DatetimeIndex(time_col_vals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking as I don't know: how would this behave with polars/pyarrow backed series?
As long as time_col_vals
is a narwhals series, you can try to convert it to datetime via:
time_col_vals.str.to_datetime()
we will try to infer the date format
@@ -649,91 +651,113 @@ def from_dataframe( | |||
TimeSeries | |||
A univariate or multivariate deterministic TimeSeries constructed from the inputs. | |||
""" | |||
df = nw.from_native(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider flagging this as
df = nw.from_native(df) | |
df = nw.from_native(df, eager_only=True, pass_through=False) |
The first flag will make sure that the return type is a (eager) DataFrame (mostly for typing purposes) , the second will make sure to raise if the input is not a supported eager frame
raise_log( | ||
ValueError( | ||
"No time column or index found in the DataFrame. `time_col=None` " | ||
"is only supported for pandas DataFrame which is indexed with one of the " | ||
"supported index types: a DatetimeIndex, a RangeIndex, or an integer " | ||
"Index that can be converted into a RangeIndex.", | ||
), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you consider the value to be np.arange(len(df))
or is that too big of an assumption?
|
||
# get time index | ||
if time_col: | ||
if time_col not in df.columns: | ||
raise_log(AttributeError(f"time_col='{time_col}' is not present.")) | ||
|
||
time_index = pd.Index([]) | ||
time_col_vals = df[time_col] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly as above, if the time_col is an integer due to pandas allowing that, the safe route would be to use:
time_col_vals = df.get_column(time_col)
@@ -1001,7 +1028,8 @@ def from_series( | |||
TimeSeries | |||
A univariate and deterministic TimeSeries constructed from the inputs. | |||
""" | |||
df = pd.DataFrame(pd_series) | |||
nw_series = nw.from_native(pd_series, series_only=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly as above, if you want to already raise for non series objects, it's possible to do so by flagging pass_through=False
Agreed @hrzn :) To any dataframe support will be added in another PR. |
Checklist before merging this PR:
Fixes #2635.
Summary
A first draft of
from_dataframe
has been adapted to work with any dataframe. This is done using narwhals and the function is calledfrom_narwhals_dataframe
. In order to test the performance of the method, a filenarwhals_test_time.py
has been added to the pull request.With the latest commits,
from_narwhals_dataframe
is now as fast asfrom_dataframe
.Other Information