-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nw_df.to(backend, index)
. A generic method from Narwhals to Pandas, Polars, and Arrow with Index Handling Options
#2056
Comments
@authierj could you provide some examples of the information that is lost - and what you'd expect the result to produce? |
Side NoteI really don't like the idea of a method named |
@dangotbanned thank you for taking a look at this. Here is a brief example: # Create a date range
date_range = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
# Create some sample data
data = np.random.randn(len(date_range), 2)
# Create a DataFrame with the date range as the index
df = pd.DataFrame(data, index=date_range, columns=['Column1', 'Column2'])
# Transformed to narwhals
nw_df = nw.from_native(df)
# do some processing
# ...
# Transformed to polars, date_range info is lost
pl_df = nw_df.to_polars() And using Ideally, I would expect to have a behavior similar to |
Thanks @authierj that was helpful to understand your use case better. Taking a step back from adding a new API for a moment, would something like adding these index handling options to methods like I can see the value in having |
Hi @dangotbanned, Yes, I think adding the index handling options would already be a great start! |
Hey @authierj thanks for reporting this. I am a bit skeptical to allow special implicit handling for pandas index. I think we are considering at least two options here, and I have open questions for both:
In both cases, narwhals provides a few tools to have a specific path for pandas to give you the flexibility for handling such case without depending on pandas itself (happy to expand with an example) - these exist exactly for the reason of being able to integrate in codebases that are based on pandas from the get-go More generally, as narwhals tries to be a subset of the polars api, heavy use on the index is supposed to not happen as Polars does not have a multi-index/index by design. I will take a better look at the PR in darts to see if we can manage to lower the index wrangling dependency without loss in performance 😇 Work has been particularly hectic recently and I couldn't do that |
Hi all, just chipping in here to give some more insights in why we thought this could be useful for narwhals. For us, the
|
Hey @dennisbader thanks for your time and clarification. I had some time to catch up what's happening in Darts and I am connecting the dots, so bear with me 🙈 From what I can see, converting from dataframe to timeseries has no big blocker and will be finalized in darts#2661 without additional features from narwhals. Viceversa, converting from timeseries to dataframe is a bit more challenging. I cannot say that I am familiar nor comfortable with xarray, but as first impression the only doubt would be how to place the (time) index in the output dataframe. Thinking out loud here (and I can expand the discussion in darts#2689), but some design like the following might require no changes: class TimeSeries:
...
def to_dataframe(self, backend: Literal["pandas", "polars", "pyarrow"], *, time_as_index: bool = False):
if time_as_index and backend != "pandas":
raise ...
columns = ...
data = ...
time_index = ...
if time_as_index:
# special path for pandas with index
import pandas as pd
return pd.DataFrame(data=data, index=index, columns=columns)
data_ = {
time_col_name: time_index, # set time_index as left-most column
**{col: data[:, idx] for idx, col in enumerate(columns)}
}
return nw.from_dict(data_, backend=backend).to_native() Notice that:
Is the approach above too far from what's needed in practice? |
We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?
We are trying to narwhalify darts.
We already implemented the method TimeSeries.to_dataframe() with narwhals. The next step would be to go from
TimeSeries.pd_dataframe()
to a more genericTimeSeries.to_dataframe()
enabling the users to choose their preferred backend. The issue is described hereWe store our TImeSeries data in an xarray, where the time information is stored as a pd.DatetimeIndex. Right now we are going from the xarray to a pandas dataframe and this is trivial.
However, going dataframe agnostic, the handling of the DatetimeIndex becomes problematic for two reasons:
Please describe the purpose of the new feature or describe the problem to solve.
to solve 2) we propose the following:
A generic method
nw_df.to(backend, keep_index)
backend = the dataframe library to return the dataframe to (Pandas, Polars, Arrows)
keep_index = [True, False]
whether the index contains important information and should therefore be reset as in Pandas before being converted to Polars or Arrows.Suggest a solution if possible.
A wrapper around
to_pandas
to_polars
andto_arrows
and a smart way to handle the index.If you have tried alternatives, please describe them below.
No response
Additional information that may help us understand your needs.
No response
The text was updated successfully, but these errors were encountered: