Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nw_df.to(backend, index). A generic method from Narwhals to Pandas, Polars, and Arrow with Index Handling Options #2056

Open
authierj opened this issue Feb 20, 2025 · 8 comments
Labels
enhancement New feature or request needs discussion

Comments

@authierj
Copy link

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

We are trying to narwhalify darts.

We already implemented the method TimeSeries.to_dataframe() with narwhals. The next step would be to go from TimeSeries.pd_dataframe() to a more generic TimeSeries.to_dataframe() enabling the users to choose their preferred backend. The issue is described here

We store our TImeSeries data in an xarray, where the time information is stored as a pd.DatetimeIndex. Right now we are going from the xarray to a pandas dataframe and this is trivial.
However, going dataframe agnostic, the handling of the DatetimeIndex becomes problematic for two reasons:

  1. Initiating the narwhals dataframe with .from_numpy() we loose all the information from the DatetimeIndex (like the frequency) except the date itself
  2. Initating the narwhals dataframe with from_native(), the index information will be lost if converted to Polars.

Please describe the purpose of the new feature or describe the problem to solve.

to solve 2) we propose the following:

A generic method nw_df.to(backend, keep_index)
backend = the dataframe library to return the dataframe to (Pandas, Polars, Arrows)
keep_index = [True, False] whether the index contains important information and should therefore be reset as in Pandas before being converted to Polars or Arrows.

Suggest a solution if possible.

A wrapper around to_pandas to_polars and to_arrows and a smart way to handle the index.

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

No response

@FBruzzesi FBruzzesi added enhancement New feature or request needs discussion labels Feb 20, 2025
@dangotbanned
Copy link
Member

@authierj could you provide some examples of the information that is lost - and what you'd expect the result to produce?

@dangotbanned
Copy link
Member

Side Note

I really don't like the idea of a method named .to() - maybe .to_backend()?

@authierj
Copy link
Author

authierj commented Feb 21, 2025

@authierj could you provide some examples of the information that is lost - and what you'd expect the result to produce?

@dangotbanned thank you for taking a look at this. Here is a brief example:

# Create a date range
date_range = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')

# Create some sample data
data = np.random.randn(len(date_range), 2)

# Create a DataFrame with the date range as the index
df = pd.DataFrame(data, index=date_range, columns=['Column1', 'Column2'])

# Transformed to narwhals
nw_df = nw.from_native(df)

# do some processing
# ...

# Transformed to polars, date_range info is lost
pl_df = nw_df.to_polars()

And using nw_df.maybe_reset_index() would not solve this problem either, as it just reset the index without storing the previous index in a column.

Ideally, I would expect to have a behavior similar to pl_df = pl.from_pandas(df.reset_index()).

@dangotbanned
Copy link
Member

dangotbanned commented Feb 21, 2025

#2056 (comment)

Thanks @authierj that was helpful to understand your use case better.

Taking a step back from adding a new API for a moment, would something like adding these index handling options to methods like nw.DataFrame.to_polars() also work for you?

I can see the value in having nw.DataFrame.to_backend() generally - but reducing the scope a bit to just exposing unified parameters from these two might be an easy win and cover the majority of eager cases (that would utilise the index):

@authierj
Copy link
Author

Hi @dangotbanned,

Yes, I think adding the index handling options would already be a great start!

@FBruzzesi
Copy link
Member

FBruzzesi commented Feb 21, 2025

Hey @authierj thanks for reporting this. I am a bit skeptical to allow special implicit handling for pandas index.

I think we are considering at least two options here, and I have open questions for both:

  • nw.maybe_reset_index without dropping the index:

    • For me this would be the "safest" way forward
    • At the same time, what should happen for backends different from pandas? What I mean is that we would end up with dataframes with different number of columns when we reset index without dropping it.
  • nw.DataFrame.to_<polars|arrow>() with index handling:

    • I have the same question, namely pandas vs other, we would end up with a different dataframe schema

In both cases, narwhals provides a few tools to have a specific path for pandas to give you the flexibility for handling such case without depending on pandas itself (happy to expand with an example) - these exist exactly for the reason of being able to integrate in codebases that are based on pandas from the get-go

More generally, as narwhals tries to be a subset of the polars api, heavy use on the index is supposed to not happen as Polars does not have a multi-index/index by design.

I will take a better look at the PR in darts to see if we can manage to lower the index wrangling dependency without loss in performance 😇 Work has been particularly hectic recently and I couldn't do that

@dennisbader
Copy link

Hi all, just chipping in here to give some more insights in why we thought this could be useful for narwhals.
Thanks a lot btw for all the help on our Darts PR and implementing the changes in narwhals @FBruzzesi, @MarcoGorelli, @dangotbanned :)

For us, the .to_backend(backend) would be quite useful:

  • Thanks to narwhals, in Darts we will have support for transforming any DataFrame[Any] (DF[Any]) into a Darts TimeSeries (TS). Behind the hood we perform DF[Any] -> DF[NW] -> TS (NW for Narwhals).
  • Now, it would be awesome to support the inverse as well. E.g. going from a TimeSeries to any DataFrame (TS -> DF[NW] -> DF[Any]). For this it would be helpful if narwhals had a helper that allows converting DF[NW] -> DF[Any] by specifying just backend (.to_backend(backend)). Right now we would have to create our own mapping, explicitly call the conversion DF[NW].to_X(), and keep track of any added to_X() from Narwhals in the future.

@FBruzzesi
Copy link
Member

FBruzzesi commented Feb 22, 2025

Hey @dennisbader thanks for your time and clarification.

I had some time to catch up what's happening in Darts and I am connecting the dots, so bear with me 🙈

From what I can see, converting from dataframe to timeseries has no big blocker and will be finalized in darts#2661 without additional features from narwhals.

Viceversa, converting from timeseries to dataframe is a bit more challenging. I cannot say that I am familiar nor comfortable with xarray, but as first impression the only doubt would be how to place the (time) index in the output dataframe.

Thinking out loud here (and I can expand the discussion in darts#2689), but some design like the following might require no changes:

class TimeSeries:
    ...
    
    def to_dataframe(self, backend: Literal["pandas", "polars", "pyarrow"], *, time_as_index: bool = False):
        
       if time_as_index and backend != "pandas":
             raise ...

        columns = ...
        data = ...
        time_index = ...

        if time_as_index:
            # special path for pandas with index
            import pandas as pd
            return pd.DataFrame(data=data, index=index, columns=columns) 
   
        data_ = {
            time_col_name: time_index,  # set time_index as left-most column
            **{col: data[:, idx] for idx, col in enumerate(columns)}
        }
        return nw.from_dict(data_, backend=backend).to_native()

Notice that:

  • I am specifying the list of backends but that would work with anything that Narwhals support
  • We currently don't support from_dict to initialize lazy dataframes, although in principle this would be possible in the future.
  • Index is quite unique to pandas (and pandas like) dataframe libraries. I am providing an example on how you might expose the setting the index directly in the API, but you can decide for a different approach, e.g.:
    • return a tuple of time_index series, data dataframe
    • delegate to the user to explicitly frame.set_index(time_col_name) is case the return type is pandas and wants to have such series as index.

Is the approach above too far from what's needed in practice?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs discussion
Projects
None yet
Development

No branches or pull requests

4 participants