Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should vizro support polars (or other dataframes besides pandas)? #286

Open
antonymilne opened this issue Jan 25, 2024 · 12 comments
Open

Should vizro support polars (or other dataframes besides pandas)? #286

antonymilne opened this issue Jan 25, 2024 · 12 comments
Assignees

Comments

@antonymilne
Copy link
Contributor

antonymilne commented Jan 25, 2024

Ty Petar, please consider supporting polars, I think it is necessary, given that the whole point of vizro is working with a dataframe in memory. Currently vizro cannot determine polars column names (detects them as [0,1,2,3,4...])

Originally posted by @vmisusu in #191 (comment)


I'm opening this issue to see whether other people have the same question so we can figure out what priority it should be. Just hit 👍 if it's something you'd like to see in vizro and feel free to leave and comments.

The current situation (25 January 2024) is:

  • vizro currently only supports pandas DataFrames, but supporting others like polars a great idea and something we did consider before. The main blocker previously was that plotly didn't support polars, but as of 5.15 it supports not just polars but actually any dataframe with a to_pandas method, and as of 5.16 it supports dataframes that follow the dataframe interchange protocol (which is now pip installable)
  • on vizro we could follow a similar sort of pattern to plotly's development1. Ideally supporting the dataframe interchange protocol is the "right" way to do this, but we should work out exactly how much performance improvement polars users would actually get in practice to see what the value of this would be over a simple to_pandas call. The biggest changes we'd need to make would be to actions code like filtering functionality (FYI @petar-qb). I don't think it would be too hard, but it's certainly not a small task either

See also How Polars Can Help You Build Fast Dash Apps for Large Datasets

From @Coding-with-Adam:

Chad had a nice app that he built to compare between pandas and polars and show the difference when using Dash. https://dash-polars-pandas-docker.onrender.com/ (free tier)
I also made a video him: https://youtu.be/_iebrqafOuM
And here’s the article he wrote: Dash: Polars vs Pandas. An interactive battle between the… | by Chad Bell | Medium

FYI @astrojuanlu

Footnotes

  1. https://github.com/plotly/plotly.py/pull/4244 https://github.com/plotly/plotly.py/pull/4272/files https://github.com/plotly/plotly.py/pull/3901 https://github.com/plotly/plotly.py/issues/3637

@antonymilne antonymilne changed the title Should we support polars? Should vizro support polars (or other dataframes besides pandas)? Jan 25, 2024
@datajoely
Copy link

Maybe Ibis is a good fit here?

@astrojuanlu
Copy link

I only reacted with 🚀 to this, but to make my position more clear,

as of 5.15 it supports not just polars but actually any dataframe with a to_pandas method, and as of 5.16 it supports dataframes that follow the dataframe interchange protocol (data-apis/dataframe-api#73)

this is awesome ⭐

but we should work out exactly how much performance improvement polars users would actually get in practice to see what the value of this would be over a simple to_pandas call.

I think it's more of a DX experience, not necessarily performance improvement. If folks are using Polars for whatever reason and then they have to do .to_pandas() to use Vizro, it feels a bit meh. If Vizro supports Polars natively, it's more pleasant.

@astrojuanlu
Copy link

Just seen on their LinkedIn:

Check migrated all 100+ of their Airflow DAGs from pandas to Polars and saved 25% in cloud expenses.

https://pola.rs/posts/case-check-technology/

@reouvenzana
Copy link

Any update about Polars integration? More and more people are using it.

@datajoely
Copy link

Supporting Narwhals may be a sensible 1st step since we get many birds with one stone...
https://github.com/narwhals-dev/narwhals

Would also like to support Ibis for the same reason.

@antonymilne
Copy link
Contributor Author

Narwhals looks very interesting, thanks for pointing it out @datajoely.

@reouvenzana no updates on this - the current situation outlined in the first post still applies here. It's something I'd still like to do but it just hasn't been prioritised yet. Your comment here though does help to bump the priority up!

When we do implement this, whatever we do is likely to closely follow plotly's pattern to begin, so there won't be any performance improvements, just a DX improvement as @astrojuanlu suggested above.

@reouvenzana are you interested in using polars in vizro for performance improvements or just for ease of use to avoid doing a to_pandas call?

@reouvenzana
Copy link

reouvenzana commented Jun 19, 2024

@antonymilne

Your comment here though does help to bump the priority up!

Nice!

@reouvenzana are you interested in using polars in vizro for performance improvements or just for ease of use to avoid doing a to_pandas call?

Honestly, it's more about the api / functionalities of Polars (easy method chaining, list columns which are really useful) than performance issues. It's bothersome to have pl.DataFrame, df.to_pandas() everywhere in my code, and also to handle the mismatch between data types. I'm aware that Dash / Plotly "supports" polars, though there is no performance gain as you've pointed out. Still, it would be great, given the increasing traction behind Polars.

@antonymilne
Copy link
Contributor Author

Got it, thank you @reouvenzana, that makes a lot of sense 👍

For future reference the logic they use to do the conversion to pandas is here:
https://github.com/plotly/plotly.py/blob/51eb5ea9fefda27bccfdb21e660b8d4035cef3b0/packages/python/plotly/plotly/express/_core.py#L1323-L1353. So any pandas>= 2.0.2 will use __dataframe__ rather than to_pandas. We would probably do something similar to this to begin with anyway.

@MarcoGorelli
Copy link

I'm aware that Dash / Plotly "supports" polars, though there is no performance gain as you've pointed out

Plotly have now moved to Narwhals, and for Polars many plots get 2-3x faster, especially those involving groupings (e.g. color='symbol')

@antonymilne
Copy link
Contributor Author

Thanks very much for the information! 🙏 I had noticed narwhals was a new dependency when we got plotly==6.0.0rc0 in our CI and all our tests broke (not because of narwhals but due to lots of new deprecation warnings) 😅 This is the sort of proper support for polars that I was hoping would be added and I'm very pleased to see it!

Do you happen to know a rough timeframe for a stable release of plotly 6?

This issue has got more than enough support now, so let's try and prioritise doing this in the new year 🚀

@MarcoGorelli
Copy link

Thanks! I heard from plotly devs that it would be 1-2 weeks after the pre-release. But, given that things always take longer than expected, I wouldn't be surprised if it ended up being early January 😄

Cool, feel free to ping me or anyone else from Narwhals if you'd like any support (and/or wanted to bring any of us in for some pair-programming - we did this with Shiny and it was quite fruitful)

@antonymilne
Copy link
Contributor Author

Amazing, thank you very much for the offer! I don't think we'd get around to doing this before January anyway so will put it on our roadmap and let's be in touch again when we come to doing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants