Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Narwhals implementation of from_dataframe and performance benchmark #2661

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

authierj
Copy link
Contributor

@authierj authierj commented Jan 31, 2025

Checklist before merging this PR:

  • Mentioned all issues that this PR fixes or addresses.
  • Summarized the updates of this PR under Summary.
  • Added an entry under Unreleased in the Changelog.

Fixes #2635.

Summary

A first draft of from_dataframe has been adapted to work with any dataframe. This is done using narwhals and the function is called from_narwhals_dataframe. In order to test the performance of the method, a file narwhals_test_time.py has been added to the pull request.
With the latest commits, from_narwhals_dataframe is now as fast as from_dataframe.

Other Information

Copy link

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for giving this a go!

I've left a couple of comments

I suspect the .to_list() calls may be responsible for the slow-down. I'll take a look

@authierj
Copy link
Contributor Author

Hi @MarcoGorelli ,

Thanks for already looking at this and for your insights!

@authierj
Copy link
Contributor Author

authierj commented Feb 3, 2025

Hi @MarcoGorelli,

I investigated the issue, and it appears that the .to_list() call is not responsible for the slowdown. However, the call series_df.to_numpy()[:, :, np.newaxis] on line 906 is very slow. The investigation is going on!

Copy link
Contributor

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @authierj for the effort on this! We really appreciate it! I left very non-relevant comments 😂

Copy link

codecov bot commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 91.17647% with 3 lines in your changes missing coverage. Please review.

Project coverage is 94.08%. Comparing base (4de0868) to head (f629089).

Files with missing lines Patch % Lines
darts/timeseries.py 91.17% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2661      +/-   ##
==========================================
- Coverage   94.17%   94.08%   -0.09%     
==========================================
  Files         141      141              
  Lines       15579    15591      +12     
==========================================
- Hits        14671    14669       -2     
- Misses        908      922      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@authierj
Copy link
Contributor Author

To compare the performance of the methods from_dataframe() (which only accepts pandas DataFrames as input) and from_narwhals_dataframe() (which accepts all kinds of DataFrames), I implemented the script narwhals_test_time.py, see here. This script calls the two functions for a large number of different pandas DataFrames, with shuffled or unshuffled data, varying sizes, indices, and datetime formats.

Averaged over 10 runs, the processing times are as follows:

method average processing time [s]
from_dataframe() 10.9718
from_narwhals_dataframe() 9.8564

Therefore, from_narwhals_dataframe() is 1.1154 seconds faster than from_dataframe(), representing a 10.17% decrease in processing time on average.

As a consequence of this significant result, I will change the implementation of from_dataframe() and also modify from_series() to use the narwhals approach.

@authierj authierj marked this pull request as ready for review February 14, 2025 12:22
@authierj
Copy link
Contributor Author

The script used for testing the time performance of the methods is the following:

import argparse
import json
import time
import warnings
from itertools import product

import numpy as np
import pandas as pd
from tqdm import tqdm

from darts.timeseries import TimeSeries

# Suppress all warnings
warnings.filterwarnings("ignore")


def test_from_dataframe(f_name: str):
    return getattr(TimeSeries, f_name)


def create_random_dataframes(
    num_rows: int = 10,
    num_columns: int = 3,
    index: bool = True,
    col_names_given: bool = True,
    start_date: str = "1900-01-01",
    freq: str = "D",
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Create three pandas DataFrames with random data and dates as the index or as a column.

    Parameters:
    - num_rows (int): The number of rows in the DataFrames.
    - num_columns (int): The number of columns in the DataFrames.
    - index (bool): If True, the date is the index of the DataFrame. If False, the date is a column named 'date'.
    - start_date (str): The start date for the date range (used only if date_format is 'date').
    - freq (str): The frequency of the date range (used only if date_format is 'date').

    Returns:
    - tuple: A tuple containing three DataFrames (df_date, df_numpy, df_integer).
    """
    # Set a random seed for reproducibility
    np.random.seed(42)

    # Generate a date range or integer list based on the date_format parameter
    date_values = pd.date_range(start=start_date, periods=num_rows, freq=freq)
    integer_values = list(range(1, num_rows + 1))
    numpy_values = np.array(
        pd.date_range(start=start_date, periods=num_rows, freq=freq),
        dtype="datetime64[D]",
    )

    # Create random data for the DataFrames
    data = {f"col_{i}": np.random.randn(num_rows) for i in range(num_columns)}

    # Create the DataFrames
    df_date = pd.DataFrame(data)
    df_numpy = pd.DataFrame(data)
    df_integer = pd.DataFrame(data)

    if col_names_given:
        col_names = df_date.columns.values
    else:
        col_names = None

    # Set the date as index or as a column based on the index parameter
    if index:
        df_date.index = date_values
        df_numpy.index = numpy_values
        df_integer.index = integer_values
    else:
        df_date["date"] = date_values
        df_numpy["date"] = numpy_values
        df_integer["date"] = integer_values

    if index:
        time_col = None
    else:
        time_col = "date"

    return [
        [df_date, col_names, time_col],
        [df_numpy, col_names, time_col],
        [df_integer, col_names, time_col],
    ]


def test_dataframes() -> list:
    test_config = product(
        [10, 100, 1000, 10000, 100000],
        [100],
        [True, False],
        [True, False],
    )

    dataframes_list = [
        create_random_dataframes(
            num_rows=num_rows,
            num_columns=num_columns,
            index=index,
            col_names_given=col_names_given,
        )
        for num_rows, num_columns, index, col_names_given in test_config
    ]

    return dataframes_list


def calculate_processing_time(
    f_name: str,
    num_iter: int,
    save_path="data/",
):
    df_list = test_dataframes()
    df_func = test_from_dataframe(f_name)

    # Initialize dictionaries to store processing times
    times = {}

    # Initialize the progress bar
    total_iterations = (
        len(df_list) * 2 * 3
    )  # 2 iterations per dataframe configuration, 3 df per config
    progress_bar = tqdm(total=total_iterations, desc="Processing DataFrames")

    for df_config in df_list:
        for df, col_names, time_col in df_config:
            num_rows = len(df)
            dict_entry = str(num_rows)

            for i in range(2):
                # on the second run we shuffle the data
                if i == 1:
                    df = df.sample(frac=1)
                    dict_entry += "_shuffled"

                begin = time.time()
                for _ in range(num_iter):
                    _ = df_func(df, value_cols=col_names, time_col=time_col, freq=None)
                end = time.time()
                timer = (end - begin) / num_iter

                if dict_entry not in times:
                    times[dict_entry] = timer
                else:
                    times[dict_entry] += timer

                # Update the progress bar
                progress_bar.update(1)

    file_name = f_name + "_avg_time_" + str(num_iter) + "_iter.json"

    # Store the average times in separate JSON files
    with open(save_path + file_name, "w") as f:
        json.dump(times, f, indent=4)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="The function to test and the number of iter can "
    )
    parser.add_argument(
        "--f_name", type=str, default="from_dataframe", help="method to time"
    )
    parser.add_argument(
        "--n_iter", type=int, default=100, help="number of function call"
    )

    args = parser.parse_args()

    f_name = args.f_name
    n_iter = args.n_iter

    calculate_processing_time(f_name, n_iter)

Copy link
Collaborator

@dennisbader dennisbader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great PR @authierj 🚀

This will be a great addition to Darts :)

I added a couple of suggestions, mainly on how to further simplify some things here and there, and improve the documentation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this file and rather have it in the comments. Also, it uses TimeSeries.from_narwhals_dataframe which is not implemented. So updating it would help :)

nfoursid>=1.0.0
numpy>=1.19.0,<2.0.0
pandas>=1.0.5
polars>=1.0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@madtoinou is working currently on adding optional dependencies to our checks. So until he added this, let's remove it from the required dependencies, and adapt the tests to skip (not remove) the polars checks at this time

Suggested change
polars>=1.0.0

@@ -569,7 +571,7 @@ def from_csv(
@classmethod
def from_dataframe(
cls,
df: pd.DataFrame,
df: IntoDataFrame,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, this doesn't render nicely in the documentation 😢

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the df description?

        df
            The DataFrame, or anything which can be converted to a narwhals DataFrame (e.g. pandas.DataFrame,
            polars.DataFrame, ...). See the `narwhals documentation
            <https://narwhals-dev.github.io/narwhals/api-reference/narwhals/#narwhals.from_native>`_ for more
            information.

@@ -1610,32 +1626,20 @@ def pd_dataframe(self, copy=True, suppress_warnings=False) -> pd.DataFrame:
"_s".join((comp_name, str(sample_id)))
for comp_name, sample_id in itertools.product(comp_name, samples)
]
data = self._xa.stack(data=(DIMS[1], DIMS[2]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can also simplify further to something like below. The goals is to not have so many if/else (especially in return where we currently call "to pandas" 4 times

        if not self.is_deterministic:
            if not suppress_warnings:
                logger.warning(
                    "You are transforming a stochastic TimeSeries (i.e., contains several samples). "
                    "The resulting DataFrame is a 2D object with all samples on the columns. "
                    "If this is not the expected behavior consider calling a function "
                    "adapted to stochastic TimeSeries like quantile_df()."
                )

            comp_name = list(self.components)
            samples = range(self.n_samples)
            columns = [
                "_s".join((comp_name, str(sample_id)))
                for comp_name, sample_id in itertools.product(comp_name, samples)
            ]
            data = self._xa.stack(data=(DIMS[1], DIMS[2])).values
        else:
            columns = self._xa.get_index(DIMS[1])
            data = self._xa[:, :, 0].values
        index = self._time_index
        
        if copy:
            columns = columns.copy()
            data = data.copy()
            index = index.copy()
        
        return pd.DataFrame(data=data, index=index, columns=columns)

)
)
time_index.name = time_col
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bring this back?

Comment on lines +700 to +711
# BUGFIX : force time-index to be timezone naive as xarray doesn't support it
# pandas.DataFrame loses the tz information if it's not its index
if time_col_vals.dtype.time_zone is not None:
logger.warning(
"The provided Datetime data was associated with a timezone, which is currently not supported "
"by xarray. To avoid unexpected behaviour, the tz information was removed. Consider calling "
f"`ts.time_index.tz_localize({time_col_vals.dtype.time_zone})` when exporting the results."
"To plot the series with the right time steps, consider setting the matplotlib.pyplot "
"`rcParams['timezone']` parameter to automatically convert the time axis back to the "
"original timezone."
)
time_col_vals = time_col_vals.dt.replace_time_zone(None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated here (from below). We can remember the time zone here and adapt further below to simplify things (see my comment below (C1) on what you can do):

Suggested change
# BUGFIX : force time-index to be timezone naive as xarray doesn't support it
# pandas.DataFrame loses the tz information if it's not its index
if time_col_vals.dtype.time_zone is not None:
logger.warning(
"The provided Datetime data was associated with a timezone, which is currently not supported "
"by xarray. To avoid unexpected behaviour, the tz information was removed. Consider calling "
f"`ts.time_index.tz_localize({time_col_vals.dtype.time_zone})` when exporting the results."
"To plot the series with the right time steps, consider setting the matplotlib.pyplot "
"`rcParams['timezone']` parameter to automatically convert the time axis back to the "
"original timezone."
)
time_col_vals = time_col_vals.dt.replace_time_zone(None)
# remove and remember time zone here as polars converts to UTC
time_zone = time_col_vals.dtype.time_zone
if time_zone is not None:
time_col_vals = time_col_vals.dt.replace_time_zone(None)
time_index = pd.DatetimeIndex(time_col_vals)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice rewriting of the tests :) 🚀

As mentioned in the requirements/core.txt comment, we'll add the polars tests as part of our optional dependecies soon. For the meantime, could you remove "polars" from the parametrization, and remove the import polars?

As soon as the optional dependency checks are up, we can add the references back in another PR :)

@hrzn
Copy link
Contributor

hrzn commented Feb 22, 2025

This is very cool and I'm sure will make many users' lives easier!
I think it might be worth updating the docs / quickstart to maybe showcase an example for creating/exporting from/to polars?

Copy link
Contributor

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @authierj , I left a few suggestions and considerations in the from_* functions! Hope they help :)

Comment on lines 660 to +662
if isinstance(value_cols, str):
value_cols = [value_cols]
series_df = df[value_cols]
series_df = df[value_cols] # quite slow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of an edge case, but pandas columns can also be integers, the safest way to deal with that would be:

if isinstance(value_cols, (str, int)):
    value_cols = [value_cols]
series_df = df.select(nw.col(*value_cols))

# Temporarily use an integer Index to sort the values, and replace by a
# RangeIndex in `TimeSeries.from_xarray()`
time_index = pd.Index(time_col_vals)

elif np.issubdtype(time_col_vals.dtype, object):
elif time_col_vals.dtype == nw.String:
# The integer conversion failed; try datetimes
try:
time_index = pd.DatetimeIndex(time_col_vals)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking as I don't know: how would this behave with polars/pyarrow backed series?

As long as time_col_vals is a narwhals series, you can try to convert it to datetime via:

time_col_vals.str.to_datetime()

we will try to infer the date format

@@ -649,91 +651,113 @@ def from_dataframe(
TimeSeries
A univariate or multivariate deterministic TimeSeries constructed from the inputs.
"""
df = nw.from_native(df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider flagging this as

Suggested change
df = nw.from_native(df)
df = nw.from_native(df, eager_only=True, pass_through=False)

The first flag will make sure that the return type is a (eager) DataFrame (mostly for typing purposes) , the second will make sure to raise if the input is not a supported eager frame

Comment on lines +722 to +729
raise_log(
ValueError(
"No time column or index found in the DataFrame. `time_col=None` "
"is only supported for pandas DataFrame which is indexed with one of the "
"supported index types: a DatetimeIndex, a RangeIndex, or an integer "
"Index that can be converted into a RangeIndex.",
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you consider the value to be np.arange(len(df)) or is that too big of an assumption?


# get time index
if time_col:
if time_col not in df.columns:
raise_log(AttributeError(f"time_col='{time_col}' is not present."))

time_index = pd.Index([])
time_col_vals = df[time_col]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly as above, if the time_col is an integer due to pandas allowing that, the safe route would be to use:

time_col_vals = df.get_column(time_col)

@@ -1001,7 +1028,8 @@ def from_series(
TimeSeries
A univariate and deterministic TimeSeries constructed from the inputs.
"""
df = pd.DataFrame(pd_series)
nw_series = nw.from_native(pd_series, series_only=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly as above, if you want to already raise for non series objects, it's possible to do so by flagging pass_through=False

@dennisbader
Copy link
Collaborator

This is very cool and I'm sure will make many users' lives easier! I think it might be worth updating the docs / quickstart to maybe showcase an example for creating/exporting from/to polars?

Agreed @hrzn :) To any dataframe support will be added in another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Use this label to request a new feature improvement New feature or improvement
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

Add TimeSeries.from_polars
5 participants