Narwhals implementation of `from_dataframe` and performance benchmark #2661

authierj · 2025-01-31T09:25:02Z

Checklist before merging this PR:

Mentioned all issues that this PR fixes or addresses.
Summarized the updates of this PR under Summary.
Added an entry under Unreleased in the Changelog.

Summary

A first draft of from_dataframe has been adapted to work with any dataframe. This is done using narwhals and the function is called from_narwhals_dataframe. In order to test the performance of the method, a file narwhals_test_time.py has been added to the pull request.
With the latest commits, from_narwhals_dataframe is now as fast as from_dataframe.

Other Information

MarcoGorelli

thanks for giving this a go!

I've left a couple of comments

I suspect the .to_list() calls may be responsible for the slow-down. I'll take a look

darts/timeseries.py

authierj · 2025-01-31T12:56:31Z

Hi @MarcoGorelli ,

Thanks for already looking at this and for your insights!

authierj · 2025-02-03T15:54:14Z

Hi @MarcoGorelli,

I investigated the issue, and it appears that the .to_list() call is not responsible for the slowdown. However, the call series_df.to_numpy()[:, :, np.newaxis] on line 906 is very slow. The investigation is going on!

FBruzzesi

Thanks @authierj for the effort on this! We really appreciate it! I left very non-relevant comments 😂

darts/timeseries.py

narwhals_test_time.py

darts/timeseries.py

codecov · 2025-02-04T09:24:28Z

Codecov Report

Attention: Patch coverage is 91.17647% with 3 lines in your changes missing coverage. Please review.

Project coverage is 94.08%. Comparing base (4de0868) to head (f629089).

Files with missing lines	Patch %	Lines
darts/timeseries.py	91.17%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2661      +/-   ##
==========================================
- Coverage   94.17%   94.08%   -0.09%     
==========================================
  Files         141      141              
  Lines       15579    15591      +12     
==========================================
- Hits        14671    14669       -2     
- Misses        908      922      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…om/authierj/darts into feature/add_timeseries_from_polars

authierj · 2025-02-11T14:53:43Z

To compare the performance of the methods from_dataframe() (which only accepts pandas DataFrames as input) and from_narwhals_dataframe() (which accepts all kinds of DataFrames), I implemented the script narwhals_test_time.py, see here. This script calls the two functions for a large number of different pandas DataFrames, with shuffled or unshuffled data, varying sizes, indices, and datetime formats.

Averaged over 10 runs, the processing times are as follows:

method	average processing time [s]
from_dataframe()	10.9718
from_narwhals_dataframe()	9.8564

Therefore, from_narwhals_dataframe() is 1.1154 seconds faster than from_dataframe(), representing a 10.17% decrease in processing time on average.

As a consequence of this significant result, I will change the implementation of from_dataframe() and also modify from_series() to use the narwhals approach.

…an optional dependency.

authierj · 2025-02-21T09:11:12Z

The script used for testing the time performance of the methods is the following:

import argparse
import json
import time
import warnings
from itertools import product

import numpy as np
import pandas as pd
from tqdm import tqdm

from darts.timeseries import TimeSeries

# Suppress all warnings
warnings.filterwarnings("ignore")


def test_from_dataframe(f_name: str):
    return getattr(TimeSeries, f_name)


def create_random_dataframes(
    num_rows: int = 10,
    num_columns: int = 3,
    index: bool = True,
    col_names_given: bool = True,
    start_date: str = "1900-01-01",
    freq: str = "D",
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Create three pandas DataFrames with random data and dates as the index or as a column.

    Parameters:
    - num_rows (int): The number of rows in the DataFrames.
    - num_columns (int): The number of columns in the DataFrames.
    - index (bool): If True, the date is the index of the DataFrame. If False, the date is a column named 'date'.
    - start_date (str): The start date for the date range (used only if date_format is 'date').
    - freq (str): The frequency of the date range (used only if date_format is 'date').

    Returns:
    - tuple: A tuple containing three DataFrames (df_date, df_numpy, df_integer).
    """
    # Set a random seed for reproducibility
    np.random.seed(42)

    # Generate a date range or integer list based on the date_format parameter
    date_values = pd.date_range(start=start_date, periods=num_rows, freq=freq)
    integer_values = list(range(1, num_rows + 1))
    numpy_values = np.array(
        pd.date_range(start=start_date, periods=num_rows, freq=freq),
        dtype="datetime64[D]",
    )

    # Create random data for the DataFrames
    data = {f"col_{i}": np.random.randn(num_rows) for i in range(num_columns)}

    # Create the DataFrames
    df_date = pd.DataFrame(data)
    df_numpy = pd.DataFrame(data)
    df_integer = pd.DataFrame(data)

    if col_names_given:
        col_names = df_date.columns.values
    else:
        col_names = None

    # Set the date as index or as a column based on the index parameter
    if index:
        df_date.index = date_values
        df_numpy.index = numpy_values
        df_integer.index = integer_values
    else:
        df_date["date"] = date_values
        df_numpy["date"] = numpy_values
        df_integer["date"] = integer_values

    if index:
        time_col = None
    else:
        time_col = "date"

    return [
        [df_date, col_names, time_col],
        [df_numpy, col_names, time_col],
        [df_integer, col_names, time_col],
    ]


def test_dataframes() -> list:
    test_config = product(
        [10, 100, 1000, 10000, 100000],
        [100],
        [True, False],
        [True, False],
    )

    dataframes_list = [
        create_random_dataframes(
            num_rows=num_rows,
            num_columns=num_columns,
            index=index,
            col_names_given=col_names_given,
        )
        for num_rows, num_columns, index, col_names_given in test_config
    ]

    return dataframes_list


def calculate_processing_time(
    f_name: str,
    num_iter: int,
    save_path="data/",
):
    df_list = test_dataframes()
    df_func = test_from_dataframe(f_name)

    # Initialize dictionaries to store processing times
    times = {}

    # Initialize the progress bar
    total_iterations = (
        len(df_list) * 2 * 3
    )  # 2 iterations per dataframe configuration, 3 df per config
    progress_bar = tqdm(total=total_iterations, desc="Processing DataFrames")

    for df_config in df_list:
        for df, col_names, time_col in df_config:
            num_rows = len(df)
            dict_entry = str(num_rows)

            for i in range(2):
                # on the second run we shuffle the data
                if i == 1:
                    df = df.sample(frac=1)
                    dict_entry += "_shuffled"

                begin = time.time()
                for _ in range(num_iter):
                    _ = df_func(df, value_cols=col_names, time_col=time_col, freq=None)
                end = time.time()
                timer = (end - begin) / num_iter

                if dict_entry not in times:
                    times[dict_entry] = timer
                else:
                    times[dict_entry] += timer

                # Update the progress bar
                progress_bar.update(1)

    file_name = f_name + "_avg_time_" + str(num_iter) + "_iter.json"

    # Store the average times in separate JSON files
    with open(save_path + file_name, "w") as f:
        json.dump(times, f, indent=4)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="The function to test and the number of iter can "
    )
    parser.add_argument(
        "--f_name", type=str, default="from_dataframe", help="method to time"
    )
    parser.add_argument(
        "--n_iter", type=int, default=100, help="number of function call"
    )

    args = parser.parse_args()

    f_name = args.f_name
    n_iter = args.n_iter

    calculate_processing_time(f_name, n_iter)

dennisbader

Thanks for this great PR @authierj 🚀

This will be a great addition to Darts :)

I added a couple of suggestions, mainly on how to further simplify some things here and there, and improve the documentation

dennisbader · 2025-02-21T12:48:10Z

narwhals_test_time.py

remove this file and rather have it in the comments. Also, it uses TimeSeries.from_narwhals_dataframe which is not implemented. So updating it would help :)

dennisbader · 2025-02-21T12:49:17Z

requirements/core.txt

 nfoursid>=1.0.0
 numpy>=1.19.0,<2.0.0
 pandas>=1.0.5
+polars>=1.0.0


@madtoinou is working currently on adding optional dependencies to our checks. So until he added this, let's remove it from the required dependencies, and adapt the tests to skip (not remove) the polars checks at this time

Suggested change

polars>=1.0.0

CHANGELOG.md

dennisbader · 2025-02-21T13:05:58Z

darts/timeseries.py

@@ -569,7 +571,7 @@ def from_csv(
    @classmethod
    def from_dataframe(
        cls,
-        df: pd.DataFrame,
+        df: IntoDataFrame,


unfortunately, this doesn't render nicely in the documentation 😢

Can you update the df description?

df The DataFrame, or anything which can be converted to a narwhals DataFrame (e.g. pandas.DataFrame, polars.DataFrame, ...). See the `narwhals documentation <https://narwhals-dev.github.io/narwhals/api-reference/narwhals/#narwhals.from_native>`_ for more information.

darts/timeseries.py

dennisbader · 2025-02-21T15:38:09Z

darts/timeseries.py

@@ -1610,32 +1626,20 @@ def pd_dataframe(self, copy=True, suppress_warnings=False) -> pd.DataFrame:
                "_s".join((comp_name, str(sample_id)))
                for comp_name, sample_id in itertools.product(comp_name, samples)
            ]
+            data = self._xa.stack(data=(DIMS[1], DIMS[2]))


Here we can also simplify further to something like below. The goals is to not have so many if/else (especially in return where we currently call "to pandas" 4 times

if not self.is_deterministic: if not suppress_warnings: logger.warning( "You are transforming a stochastic TimeSeries (i.e., contains several samples). " "The resulting DataFrame is a 2D object with all samples on the columns. " "If this is not the expected behavior consider calling a function " "adapted to stochastic TimeSeries like quantile_df()." ) comp_name = list(self.components) samples = range(self.n_samples) columns = [ "_s".join((comp_name, str(sample_id))) for comp_name, sample_id in itertools.product(comp_name, samples) ] data = self._xa.stack(data=(DIMS[1], DIMS[2])).values else: columns = self._xa.get_index(DIMS[1]) data = self._xa[:, :, 0].values index = self._time_index if copy: columns = columns.copy() data = data.copy() index = index.copy() return pd.DataFrame(data=data, index=index, columns=columns)

dennisbader · 2025-02-21T15:47:06Z

darts/timeseries.py

                    )
                )
-            time_index.name = time_col


bring this back?

darts/timeseries.py

dennisbader · 2025-02-21T17:06:17Z

darts/timeseries.py

+                # BUGFIX : force time-index to be timezone naive as xarray doesn't support it
+                # pandas.DataFrame loses the tz information if it's not its index
+                if time_col_vals.dtype.time_zone is not None:
+                    logger.warning(
+                        "The provided Datetime data was associated with a timezone, which is currently not supported "
+                        "by xarray. To avoid unexpected behaviour, the tz information was removed. Consider calling "
+                        f"`ts.time_index.tz_localize({time_col_vals.dtype.time_zone})` when exporting the results."
+                        "To plot the series with the right time steps, consider setting the matplotlib.pyplot "
+                        "`rcParams['timezone']` parameter to automatically convert the time axis back to the "
+                        "original timezone."
+                    )
+                    time_col_vals = time_col_vals.dt.replace_time_zone(None)


This is duplicated here (from below). We can remember the time zone here and adapt further below to simplify things (see my comment below (C1) on what you can do):

Suggested change

# BUGFIX : force time-index to be timezone naive as xarray doesn't support it

# pandas.DataFrame loses the tz information if it's not its index

if time_col_vals.dtype.time_zone is not None:

logger.warning(

"The provided Datetime data was associated with a timezone, which is currently not supported "

"by xarray. To avoid unexpected behaviour, the tz information was removed. Consider calling "

f"`ts.time_index.tz_localize({time_col_vals.dtype.time_zone})` when exporting the results."

"To plot the series with the right time steps, consider setting the matplotlib.pyplot "

"`rcParams['timezone']` parameter to automatically convert the time axis back to the "

"original timezone."

)

time_col_vals = time_col_vals.dt.replace_time_zone(None)

# remove and remember time zone here as polars converts to UTC

time_zone = time_col_vals.dtype.time_zone

if time_zone is not None:

time_col_vals = time_col_vals.dt.replace_time_zone(None)

time_index = pd.DatetimeIndex(time_col_vals)

dennisbader · 2025-02-21T17:14:37Z

darts/tests/test_timeseries.py

nice rewriting of the tests :) 🚀

As mentioned in the requirements/core.txt comment, we'll add the polars tests as part of our optional dependecies soon. For the meantime, could you remove "polars" from the parametrization, and remove the import polars?

As soon as the optional dependency checks are up, we can add the references back in another PR :)

Co-authored-by: Dennis Bader <dennis.bader@gmx.ch>

…om/authierj/darts into feature/add_timeseries_from_polars

hrzn · 2025-02-22T07:50:42Z

This is very cool and I'm sure will make many users' lives easier!
I think it might be worth updating the docs / quickstart to maybe showcase an example for creating/exporting from/to polars?

FBruzzesi

Hey @authierj , I left a few suggestions and considerations in the from_* functions! Hope they help :)

FBruzzesi · 2025-02-22T10:01:16Z

darts/timeseries.py

            if isinstance(value_cols, str):
                value_cols = [value_cols]
-            series_df = df[value_cols]
+            series_df = df[value_cols]  # quite slow


A bit of an edge case, but pandas columns can also be integers, the safest way to deal with that would be:

if isinstance(value_cols, (str, int)): value_cols = [value_cols] series_df = df.select(nw.col(*value_cols))

FBruzzesi · 2025-02-22T10:04:45Z

darts/timeseries.py

                # Temporarily use an integer Index to sort the values, and replace by a
                # RangeIndex in `TimeSeries.from_xarray()`
                time_index = pd.Index(time_col_vals)

-            elif np.issubdtype(time_col_vals.dtype, object):
+            elif time_col_vals.dtype == nw.String:
                # The integer conversion failed; try datetimes
                try:
                    time_index = pd.DatetimeIndex(time_col_vals)


Asking as I don't know: how would this behave with polars/pyarrow backed series?

As long as time_col_vals is a narwhals series, you can try to convert it to datetime via:

time_col_vals.str.to_datetime()

we will try to infer the date format

FBruzzesi · 2025-02-22T10:12:54Z

darts/timeseries.py

@@ -649,91 +651,113 @@ def from_dataframe(
        TimeSeries
            A univariate or multivariate deterministic TimeSeries constructed from the inputs.
        """
+        df = nw.from_native(df)


Consider flagging this as

Suggested change

df = nw.from_native(df)

df = nw.from_native(df, eager_only=True, pass_through=False)

The first flag will make sure that the return type is a (eager) DataFrame (mostly for typing purposes) , the second will make sure to raise if the input is not a supported eager frame

FBruzzesi · 2025-02-22T10:19:08Z

darts/timeseries.py

+                raise_log(
+                    ValueError(
+                        "No time column or index found in the DataFrame. `time_col=None` "
+                        "is only supported for pandas DataFrame which is indexed with one of the "
+                        "supported index types: a DatetimeIndex, a RangeIndex, or an integer "
+                        "Index that can be converted into a RangeIndex.",
+                    ),
+                )


Should you consider the value to be np.arange(len(df)) or is that too big of an assumption?

FBruzzesi · 2025-02-22T10:21:04Z

darts/timeseries.py


        # get time index
        if time_col:
            if time_col not in df.columns:
                raise_log(AttributeError(f"time_col='{time_col}' is not present."))

-            time_index = pd.Index([])
            time_col_vals = df[time_col]


Similarly as above, if the time_col is an integer due to pandas allowing that, the safe route would be to use:

time_col_vals = df.get_column(time_col)

FBruzzesi · 2025-02-22T10:23:42Z

darts/timeseries.py

@@ -1001,7 +1028,8 @@ def from_series(
        TimeSeries
            A univariate and deterministic TimeSeries constructed from the inputs.
        """
-        df = pd.DataFrame(pd_series)
+        nw_series = nw.from_native(pd_series, series_only=True)


Similarly as above, if you want to already raise for non series objects, it's possible to do so by flagging pass_through=False

dennisbader · 2025-02-23T12:24:23Z

This is very cool and I'm sure will make many users' lives easier! I think it might be worth updating the docs / quickstart to maybe showcase an example for creating/exporting from/to polars?

Agreed @hrzn :) To any dataframe support will be added in another PR.

“authierj” and others added 2 commits January 31, 2025 10:16

narwhals implementation for and test benchmark

28a9298

Merge branch 'master' into feature/add_timeseries_from_polars

6382082

MarcoGorelli reviewed Jan 31, 2025

View reviewed changes

darts/timeseries.py Outdated Show resolved Hide resolved

darts/timeseries.py Outdated Show resolved Hide resolved

MarcoGorelli mentioned this pull request Jan 31, 2025

enh: add more dtype methods: is_integer, is_signed_integer, is_unsigned_integer, is_float, is_temporal narwhals-dev/narwhals#1899

Closed

FBruzzesi reviewed Feb 3, 2025

View reviewed changes

darts/timeseries.py Outdated Show resolved Hide resolved

narwhals_test_time.py Outdated Show resolved Hide resolved

darts/timeseries.py Outdated Show resolved Hide resolved

changes from MarcoGorelli incorporated

0041203

MarcoGorelli mentioned this pull request Feb 4, 2025

perf: use fastpath in DataFrame.to_numpy for pandas, improve performance for DataFrame.schema for pandas, use fewer values to sniff dtype for pandas objects narwhals-dev/narwhals#1929

Merged

10 tasks

dennisbader assigned authierj Feb 4, 2025

dennisbader added feature request Use this label to request a new feature improvement New feature or improvement labels Feb 4, 2025

“authierj” and others added 8 commits February 6, 2025 09:27

improvement thanks to reviewers

576e88e

Merge branch 'master' into feature/add_timeseries_from_polars

e013a42

added comments about slow and fast parts of the code

dbe2cd9

using pandas index to avoid .to_list()

b2ffc67

Merge branch 'master' into feature/add_timeseries_from_polars

c5fa503

bug fix added

79312c9

Merge branch 'feature/add_timeseries_from_polars' of https://github.c…

fc8bda4

…om/authierj/darts into feature/add_timeseries_from_polars

updated test script

b08a74f

authierj added 3 commits February 12, 2025 11:56

narwhals timeseries added

2425fbe

from_series changed, names changed

36300f2

changelog updated

ba01df1

authierj marked this pull request as ready for review February 14, 2025 12:22

authierj requested review from dennisbader and madtoinou as code owners February 14, 2025 12:22

authierj and others added 2 commits February 14, 2025 13:23

Merge branch 'master' into feature/add_timeseries_from_polars

ffd1202

small improvement

2e39269

authierj and others added 6 commits February 17, 2025 17:01

clean test scripts added

1a9a266

Merge branch 'master' into feature/add_timeseries_from_polars

a030ea5

BUGFIX added for non_pandas df

2c24a39

tests added for polars df

89f23fb

polars and narwhals added to dependencies. Ideally, polars should be …

de0a32d

…an optional dependency.

Merge branch 'master' into feature/add_timeseries_from_polars

66b770d

authierj mentioned this pull request Feb 20, 2025

to_dataframe(backend) and to_series(backend) extension #2689

Open

authierj added 2 commits February 20, 2025 15:25

refactoring pd_series and pd_dataframe

16bac00

removed test scripts from git repo

0950910

Merge branch 'master' into feature/add_timeseries_from_polars

042f9fb

dennisbader requested changes Feb 21, 2025

View reviewed changes

authierj and others added 5 commits February 21, 2025 19:59

Update CHANGELOG.md

5afc721

Co-authored-by: Dennis Bader <dennis.bader@gmx.ch>

Update darts/timeseries.py

7877dd6

Co-authored-by: Dennis Bader <dennis.bader@gmx.ch>

easy corrections applied

102a26c

Merge branch 'feature/add_timeseries_from_polars' of https://github.c…

9d66c06

…om/authierj/darts into feature/add_timeseries_from_polars

Merge branch 'master' into feature/add_timeseries_from_polars

f629089

FBruzzesi reviewed Feb 22, 2025

View reviewed changes

FBruzzesi mentioned this pull request Feb 22, 2025

Add nw_df.to(backend, index). A generic method from Narwhals to Pandas, Polars, and Arrow with Index Handling Options narwhals-dev/narwhals#2056

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Narwhals implementation of `from_dataframe` and performance benchmark #2661

Narwhals implementation of `from_dataframe` and performance benchmark #2661

authierj commented Jan 31, 2025 •

edited

Loading

MarcoGorelli left a comment

authierj commented Jan 31, 2025

authierj commented Feb 3, 2025

FBruzzesi left a comment

codecov bot commented Feb 4, 2025 •

edited

Loading

authierj commented Feb 11, 2025

authierj commented Feb 21, 2025

dennisbader left a comment

dennisbader Feb 21, 2025

dennisbader Feb 21, 2025

dennisbader Feb 21, 2025

dennisbader Feb 21, 2025

dennisbader Feb 21, 2025

dennisbader Feb 21, 2025

dennisbader Feb 21, 2025

dennisbader Feb 21, 2025

hrzn commented Feb 22, 2025 •

edited

Loading

FBruzzesi left a comment

FBruzzesi Feb 22, 2025

FBruzzesi Feb 22, 2025

FBruzzesi Feb 22, 2025

FBruzzesi Feb 22, 2025

FBruzzesi Feb 22, 2025

FBruzzesi Feb 22, 2025

dennisbader commented Feb 23, 2025

	df = nw.from_native(df)
	df = nw.from_native(df, eager_only=True, pass_through=False)

Narwhals implementation of from_dataframe and performance benchmark #2661

Are you sure you want to change the base?

Narwhals implementation of from_dataframe and performance benchmark #2661

Conversation

authierj commented Jan 31, 2025 • edited Loading

Summary

Other Information

MarcoGorelli left a comment

Choose a reason for hiding this comment

authierj commented Jan 31, 2025

authierj commented Feb 3, 2025

FBruzzesi left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 4, 2025 • edited Loading

Codecov Report

authierj commented Feb 11, 2025

authierj commented Feb 21, 2025

dennisbader left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hrzn commented Feb 22, 2025 • edited Loading

FBruzzesi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dennisbader commented Feb 23, 2025

Narwhals implementation of `from_dataframe` and performance benchmark #2661

Narwhals implementation of `from_dataframe` and performance benchmark #2661

authierj commented Jan 31, 2025 •

edited

Loading

codecov bot commented Feb 4, 2025 •

edited

Loading

hrzn commented Feb 22, 2025 •

edited

Loading