Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Process Parquet files #987

Merged
merged 5 commits into from
Apr 11, 2023
Merged

Conversation

stevhliu
Copy link
Member

This PR adds a guide for how to process Parquet files:

  • Eager dataframes with pd/pl.read_parquet
  • Lazy dataframes with pl.scan_parquet
  • Read and query with DuckDB

@stevhliu stevhliu requested review from lhoestq and mariosasko March 24, 2023 19:15
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 24, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ! This is super useful :)

Just two comments on the section title:

  1. Do you think we can make it clearer in the title that you can query any dataset ? IMO the current title "Process Parquet files" may suggest that it can only be useful if you have parquet files already. Maybe mention that the parquet files are "published", or use a verb that suggest that the data is remote e.g. "access" in addition to "process".
  2. (nit) Both polars and duckdb examples run "queries" on parquet files. Maybe say "query" instead of "process" ? I think "process" can be mentioned for data manipulation that involve user defined functions (UDF) - with Spark for example

docs/source/parquet_process.mdx Outdated Show resolved Hide resolved
Copy link
Contributor

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

IMO it makes sense to remove the read_polars/read_pandas examples from the List Parquet files and instead provide a link to this guide now that we have it.

And maybe we can mention at the end of the guide that Parquet files can also be queried in the browser with DuckDB WASM.

PS: doc-builder generates (empty) CURL version of the examples, would be nice if this can be removed

@stevhliu
Copy link
Member Author

PS: doc-builder generates (empty) CURL version of the examples, would be nice if this can be removed

Oh! I didn't think the doc-builder would generate cURL examples even though I didn't specify them in the <inferencesnippet>. @mishig25 is there a way to remove this?

@stevhliu stevhliu marked this pull request as ready for review March 27, 2023 18:41
Copy link
Contributor

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving but let's see if we can fix the issue with the cURL version of the examples being displayed before merging!

docs/source/parquet_process.mdx Outdated Show resolved Hide resolved
docs/source/parquet_process.mdx Show resolved Hide resolved
@severo severo mentioned this pull request Mar 28, 2023
@stevhliu
Copy link
Member Author

stevhliu commented Apr 4, 2023

@mariosasko, empty cURL example fixed in huggingface/doc-builder#365!

@stevhliu stevhliu merged commit 571d626 into huggingface:main Apr 11, 2023
@stevhliu stevhliu deleted the parquet-guide branch April 11, 2023 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants