Skip to content

Commit

Permalink
[docs] Process Parquet files (#987)
Browse files Browse the repository at this point in the history
* add how to process parquet files

* apply feedback

* fix toctree title

* apply feedback/light edits
  • Loading branch information
stevhliu authored Apr 11, 2023
1 parent af1a46c commit 571d626
Show file tree
Hide file tree
Showing 2 changed files with 238 additions and 1 deletion.
4 changes: 3 additions & 1 deletion docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@
- local: first_rows
title: Preview a dataset
- local: parquet
title: List Parquet files
title: List Parquet file
- local: parquet_process
title: Query datasets from Datasets Server
- title: Conceptual Guides
sections:
- local: configs_and_splits
Expand Down
235 changes: 235 additions & 0 deletions docs/source/parquet_process.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
# Query datasets from Datasets Server

Datasets Server automatically converts and publishes datasets on the Hub as Parquet files. [Parquet](https://parquet.apache.org/docs/) files are column-based and they shine when you're working with big data. There are several ways you can work with Parquet files, and this guide will show you how to:

- read and query Parquet files with Pandas and Polars
- connect, read and query Parquet files with DuckDB and DuckDB-Wasm

## Polars

[Polars](https://pola-rs.github.io/polars-book/user-guide/introduction.html) is a fast DataFrame library written in Rust with Arrow as its foundation.

<Tip>

💡 Learn more about how to get the dataset URLs in the [List Parquet files](parquet) guide.

</Tip>

Let's start by grabbing the URLs to the `train` split of the [`blog_authorship_corpus`](https://huggingface.co/datasets/blog_authorship_corpus) dataset from Datasets Server:

```py
r = requests.get("https://datasets-server.huggingface.co/parquet?dataset=blog_authorship_corpus")
j = r.json()
urls = [f['url'] for f in j['parquet_files'] if f['split'] == 'train']
urls
['https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-train-00000-of-00002.parquet',
'https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-train-00001-of-00002.parquet']
```

To read from a single Parquet file, use the [`read_parquet`](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_parquet.html) function to read it into a DataFrame and then execute your query:

```py
import polars as pl

df = (
pl.read_parquet("https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-train-00000-of-00002.parquet")
.groupby("horoscope")
.agg(
[
pl.count(),
pl.col("text").str.n_chars().mean().alias("avg_blog_length")
]
)
.sort("avg_blog_length", descending=True)
.limit(5)
)
print(df)
shape: (5, 3)
┌───────────┬───────┬─────────────────┐
│ horoscope ┆ count ┆ avg_blog_length │
---------
str ┆ u32 ┆ f64 │
╞═══════════╪═══════╪═════════════════╡
│ Aquarius ┆ 340621129.218836
│ Cancer ┆ 415091098.366812
│ Capricorn ┆ 339611073.2002
│ Libra ┆ 403021072.071833
│ Leo ┆ 405871064.053687
└───────────┴───────┴─────────────────┘
```

To read multiple Parquet files - for example, if the dataset is sharded - you'll need to use the [`concat`](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.concat.html) function to concatenate the files into a single DataFrame:

```py
import polars as pl
df = (
pl.concat([pl.read_parquet(url) for url in urls])
.groupby("horoscope")
.agg(
[
pl.count(),
pl.col("text").str.n_chars().mean().alias("avg_blog_length")
]
)
.sort("avg_blog_length", descending=True)
.limit(5)
)
print(df)
shape: (5, 3)
┌─────────────┬───────┬─────────────────┐
│ horoscope ┆ count ┆ avg_blog_length │
---------
str ┆ u32 ┆ f64 │
╞═════════════╪═══════╪═════════════════╡
│ Aquarius ┆ 495681125.830677
│ Cancer ┆ 635121097.956087
│ Libra ┆ 603041060.611054
│ Capricorn ┆ 494021059.555261
│ Sagittarius ┆ 504311057.458984
└─────────────┴───────┴─────────────────┘
```

### Lazy API

Polars offers a [lazy API](https://pola-rs.github.io/polars-book/user-guide/lazy-api/intro.html) that is more performant and memory-efficient for large Parquet files. The LazyFrame API keeps track of what you want to do, and it'll only execute the entire query when you're ready. This way, the lazy API doesn't load everything into RAM beforehand, and it allows you to work with datasets larger than your available RAM.

To lazily read a Parquet file, use the [`scan_parquet`](https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_parquet.html) function instead. Then, execute the entire query with the [`collect`](https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.collect.html) function:

```py
import polars as pl

q = (
pl.scan_parquet("https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-train-00000-of-00002.parquet")
.groupby("horoscope")
.agg(
[
pl.count(),
pl.col("text").str.n_chars().mean().alias("avg_blog_length")
]
)
.sort("avg_blog_length", descending=True)
.limit(5)
)
df = q.collect()
```

## Pandas

You can also use the popular Pandas DataFrame library to read Parquet files.

To read from a single Parquet file, use the [`read_parquet`](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html) function to read it into a DataFrame:

```py
import pandas as pd

df = (
pd.read_parquet("https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-train-00000-of-00002.parquet")
.groupby('horoscope')['text']
.apply(lambda x: x.str.len().mean())
.sort_values(ascending=False)
.head(5)
)
```

To read multiple Parquet files - for example, if the dataset is sharded - you'll need to use the [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) function to concatenate the files into a single DataFrame:

```py
df = (
pd.concat([pd.read_parquet(url) for url in urls])
.groupby('horoscope')['text']
.apply(lambda x: x.str.len().mean())
.sort_values(ascending=False)
.head(5)
)
```

## DuckDB

[DuckDB](https://duckdb.org/docs/) is a database that supports reading and querying Parquet files really fast. Begin by creating a connection to DuckDB, and then install and load the [`httpfs`](https://duckdb.org/docs/extensions/httpfs.html) extension to read and write remote files:

<inferencesnippet>
<python>
```py
import duckdb

url = "https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-train-00000-of-00002.parquet"

con = duckdb.connect()
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")
```
</python>
<js>
```js
var duckdb = require('duckdb');
var db = new duckdb.Database(':memory:');
var con = db.connect();
con.exec('INSTALL httpfs');
con.exec('LOAD httpfs');

const url = "https://huggingface.co/datasets/blog_authorship_corpus/resolve/refs%2Fconvert%2Fparquet/blog_authorship_corpus/blog_authorship_corpus-train-00000-of-00002.parquet"
```
</js>
</inferencesnippet>

Now you can write and execute your SQL query on the Parquet file:

<inferencesnippet>
<python>
```py
con.sql(f"SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM '{url}' GROUP BY horoscope ORDER BY avg_blog_length DESC LIMIT(5)")
┌───────────┬──────────────┬────────────────────┐
│ horoscope │ count_star() │ avg_blog_length │
│ varchar │ int64 │ double │
├───────────┼──────────────┼────────────────────┤
│ Aquarius │ 340621129.218836239798
│ Cancer │ 415091098.366812016671
│ Capricorn │ 339611073.2002002296751
│ Libra │ 403021072.0718326633914
│ Leo │ 405871064.0536871412028
└───────────┴──────────────┴────────────────────┘
```
</python>
<js>
```js
con.all(`SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM '${url}' GROUP BY horoscope ORDER BY avg_blog_length DESC LIMIT(5)`, function(err, res) {
if (err) {
throw err;
}
console.log(res)
});
```
</js>
</inferencesnippet>

To query multiple files - for example, if the dataset is sharded:

<inferencesnippet>
<python>
```py
con.sql(f"SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM read_parquet({urls[:2]}) GROUP BY horoscope ORDER BY avg_blog_length DESC LIMIT(5)")
┌─────────────┬──────────────┬────────────────────┐
│ horoscope │ count_star() │ avg_blog_length │
│ varchar │ int64 │ double │
├─────────────┼──────────────┼────────────────────┤
│ Aquarius │ 495681125.8306770497095
│ Cancer │ 635121097.95608703867
│ Libra │ 603041060.6110539931017
│ Capricorn │ 494021059.5552609206104
│ Sagittarius │ 504311057.4589835616982
└─────────────┴──────────────┴────────────────────┘
```
</python>
<js>
```js
con.all(`SELECT horoscope, count(*), AVG(LENGTH(text)) AS avg_blog_length FROM read_parquet(${JSON.stringify(urls)}) GROUP BY horoscope ORDER BY avg_blog_length DESC LIMIT(5)`, function(err, res) {
if (err) {
throw err;
}
console.log(res)
});
```
</js>
</inferencesnippet>

[DuckDB-Wasm](https://duckdb.org/docs/api/wasm), a package powered by , is also availabe for running DuckDB in a browser. This could be useful, for instance, if you want to create a web app to query Parquet files from the browser!

0 comments on commit 571d626

Please sign in to comment.