-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] Pandas to Polars #978
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice ! Do you think we should keep an example with pandas somewhere ? It can also be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job!
Yes, let's also have an example with {polars/pandas}.read_parquet
to show to get a standard DataFrame from the parquet version of a dataset. For instance, we could explain that scan_parquet
(lazily) reads a parquet file without loading all of its contents into RAM and, as such, can inspect large Parquet files while keeping memory usage as low as possible, but read_parquet
should give better performance for multiple (uncorrelated) queries if RAM is not an issue.
Cool! Should we have that info here or in the new Parquet guide? It might be better in the new guide since this one is about listing the files, and I think it's better not to stray too much into explaining the different ways and pros/cons of reading Parquet files. I can add a |
Feel free to split the guide into two guides, but I think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, and splitting sounds good as well :)
Thanks! A detail: by merging this PR before #987, we have a broken link at https://github.com/huggingface/datasets-server/pull/978/files#diff-92a1916282fa4dd583217985f2e4bfae937a001fba6549f47bd9396b74dc8be3R160. |
Oops sorry! Maybe we can remove or hide the link until #987 is merged? |
No, no worry. I don't think the traffic to the docs is such yet that we need to do it. Let's just wait until #987 is merged |
Sorry for the wait! This PR updates the current code examples in the Parquet docs to use Polars instead of Pandas. It also switches out the
alexandriainst/danish-wit
with theamazon_polarity
dataset because it returned an error saying conversion is limited to datasets under 5GB.I'll follow this up with another PR for the new Parquet guide (querying/use in web apps with duckdb) 🙂