[docs] Process Parquet files #987

stevhliu · 2023-03-24T19:15:26Z

This PR adds a guide for how to process Parquet files:

Eager dataframes with pd/pl.read_parquet
Lazy dataframes with pl.scan_parquet
Read and query with DuckDB

HuggingFaceDocBuilderDev · 2023-03-24T19:18:56Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

Nice ! This is super useful :)

Just two comments on the section title:

Do you think we can make it clearer in the title that you can query any dataset ? IMO the current title "Process Parquet files" may suggest that it can only be useful if you have parquet files already. Maybe mention that the parquet files are "published", or use a verb that suggest that the data is remote e.g. "access" in addition to "process".
(nit) Both polars and duckdb examples run "queries" on parquet files. Maybe say "query" instead of "process" ? I think "process" can be mentioned for data manipulation that involve user defined functions (UDF) - with Spark for example

docs/source/parquet_process.mdx

mariosasko

Nice!

IMO it makes sense to remove the read_polars/read_pandas examples from the List Parquet files and instead provide a link to this guide now that we have it.

And maybe we can mention at the end of the guide that Parquet files can also be queried in the browser with DuckDB WASM.

PS: doc-builder generates (empty) CURL version of the examples, would be nice if this can be removed

stevhliu · 2023-03-27T18:40:39Z

PS: doc-builder generates (empty) CURL version of the examples, would be nice if this can be removed

Oh! I didn't think the doc-builder would generate cURL examples even though I didn't specify them in the <inferencesnippet>. @mishig25 is there a way to remove this?

mariosasko

Approving but let's see if we can fix the issue with the cURL version of the examples being displayed before merging!

docs/source/parquet_process.mdx

stevhliu · 2023-04-04T16:22:25Z

@mariosasko, empty cURL example fixed in huggingface/doc-builder#365!

add how to process parquet files

4facb2c

stevhliu requested review from lhoestq and mariosasko March 24, 2023 19:15

lhoestq reviewed Mar 27, 2023

View reviewed changes

docs/source/parquet_process.mdx Outdated Show resolved Hide resolved

mariosasko reviewed Mar 27, 2023

View reviewed changes

apply feedback

4415345

stevhliu marked this pull request as ready for review March 27, 2023 18:41

fix toctree title

c511883

mariosasko approved these changes Mar 28, 2023

View reviewed changes

docs/source/parquet_process.mdx Outdated Show resolved Hide resolved

docs/source/parquet_process.mdx Show resolved Hide resolved

apply feedback/light edits

1216d8b

severo mentioned this pull request Mar 28, 2023

[docs] Pandas to Polars #978

Merged

Merge branch 'main' into parquet-guide

8432b17

stevhliu merged commit 571d626 into huggingface:main Apr 11, 2023

stevhliu deleted the parquet-guide branch April 11, 2023 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Process Parquet files #987

[docs] Process Parquet files #987

stevhliu commented Mar 24, 2023

HuggingFaceDocBuilderDev commented Mar 24, 2023 •

edited

Loading

lhoestq left a comment

mariosasko left a comment

stevhliu commented Mar 27, 2023

mariosasko left a comment

stevhliu commented Apr 4, 2023

[docs] Process Parquet files #987

[docs] Process Parquet files #987

Conversation

stevhliu commented Mar 24, 2023

HuggingFaceDocBuilderDev commented Mar 24, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

mariosasko left a comment

Choose a reason for hiding this comment

stevhliu commented Mar 27, 2023

mariosasko left a comment

Choose a reason for hiding this comment

stevhliu commented Apr 4, 2023

HuggingFaceDocBuilderDev commented Mar 24, 2023 •

edited

Loading