Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Pandas to Polars #978

Merged
merged 3 commits into from
Mar 28, 2023
Merged

Conversation

stevhliu
Copy link
Member

Sorry for the wait! This PR updates the current code examples in the Parquet docs to use Polars instead of Pandas. It also switches out the alexandriainst/danish-wit with the amazon_polarity dataset because it returned an error saying conversion is limited to datasets under 5GB.

I'll follow this up with another PR for the new Parquet guide (querying/use in web apps with duckdb) 🙂

@stevhliu stevhliu requested review from lhoestq and mariosasko March 22, 2023 23:53
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 22, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice ! Do you think we should keep an example with pandas somewhere ? It can also be useful

Copy link
Contributor

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

Yes, let's also have an example with {polars/pandas}.read_parquet to show to get a standard DataFrame from the parquet version of a dataset. For instance, we could explain that scan_parquet (lazily) reads a parquet file without loading all of its contents into RAM and, as such, can inspect large Parquet files while keeping memory usage as low as possible, but read_parquet should give better performance for multiple (uncorrelated) queries if RAM is not an issue.

@stevhliu
Copy link
Member Author

Yes, let's also have an example with {polars/pandas}.read_parquet to show to get a standard DataFrame from the parquet version of a dataset

Cool! Should we have that info here or in the new Parquet guide? It might be better in the new guide since this one is about listing the files, and I think it's better not to stray too much into explaining the different ways and pros/cons of reading Parquet files. I can add a <Tip> here with a link to the new guide so users can still easily find this info.

@mariosasko
Copy link
Contributor

Feel free to split the guide into two guides, but I think scan_parquet and read_parquet should be in the same guide (they both process Parquet files)

Copy link
Contributor

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, and splitting sounds good as well :)

@stevhliu stevhliu merged commit 3a98ba6 into huggingface:main Mar 28, 2023
@stevhliu stevhliu deleted the update-parquet-example branch March 28, 2023 17:07
@severo
Copy link
Collaborator

severo commented Mar 28, 2023

@stevhliu
Copy link
Member Author

Oops sorry! Maybe we can remove or hide the link until #987 is merged?

@severo
Copy link
Collaborator

severo commented Mar 29, 2023

No, no worry. I don't think the traffic to the docs is such yet that we need to do it. Let's just wait until #987 is merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants