Skip to content

Commit

Permalink
[docs] Pandas to Polars (#978)
Browse files Browse the repository at this point in the history
* pandas to polars

* remove code examples for reading files

* light edits
  • Loading branch information
stevhliu authored Mar 28, 2023
1 parent a4665b4 commit 3a98ba6
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 116 deletions.
2 changes: 1 addition & 1 deletion docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
- local: first_rows
title: Preview a dataset
- local: parquet
title: List parquet files
title: List Parquet files
- title: Conceptual Guides
sections:
- local: configs_and_splits
Expand Down
154 changes: 39 additions & 115 deletions docs/source/parquet.mdx
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# List parquet files
# List Parquet files

Datasets can be published in any format (CSV, JSONL, directories of images, etc.) on the Hub, and people generally use the [`datasets` library](https://huggingface.co/docs/datasets/) to access the data. To make it even easier, the datasets-server automatically converts every dataset to the [Parquet](https://parquet.apache.org/) format and publishes the parquet files on the Hub (in a specific branch: `ref/convert/parquet`).
Datasets can be published in any format (CSV, JSONL, directories of images, etc.) to the Hub, and they are easily accessed with the 🤗 [Datasets](https://huggingface.co/docs/datasets/) library. For a more performant experience (especially when it comes to large datasets), Datasets Server automatically converts every dataset to the [Parquet](https://parquet.apache.org/) format. The Parquet files are published to the Hub on a specific `ref/convert/parquet` branch (like this `amazon_polarity` [branch](https://huggingface.co/datasets/amazon_polarity/tree/refs%2Fconvert%2Fparquet) for example).

This guide shows you how to use Datasets Server's `/parquet` endpoint to retrieve the list of a dataset's parquet files programmatically. Feel free to also try it out with [Postman](https://www.postman.com/huggingface/workspace/hugging-face-apis/request/23242779-f0cde3b9-c2ee-4062-aaca-65c4cfdd96f8), [RapidAPI](https://rapidapi.com/hugging-face-hugging-face-default/api/hugging-face-datasets-api), or [ReDoc](https://redocly.github.io/redoc/?url=https://datasets-server.huggingface.co/openapi.json#operation/listSplits)
This guide shows you how to use Datasets Server's `/parquet` endpoint to retrieve a list of a dataset's files converted to Parquet. Feel free to also try it out with [Postman](https://www.postman.com/huggingface/workspace/hugging-face-apis/request/23242779-f0cde3b9-c2ee-4062-aaca-65c4cfdd96f8), [RapidAPI](https://rapidapi.com/hugging-face-hugging-face-default/api/hugging-face-datasets-api), or [ReDoc](https://redocly.github.io/redoc/?url=https://datasets-server.huggingface.co/openapi.json#operation/listSplits).

The `/parquet` endpoint accepts the dataset name as its query parameter:

Expand Down Expand Up @@ -46,7 +46,9 @@ curl https://datasets-server.huggingface.co/parquet?dataset=duorc \
</curl>
</inferencesnippet>

The endpoint response is a JSON containing a list of the dataset's parquet files. For example, the [duorc](https://huggingface.co/datasets/duorc) dataset has six parquet files, which corresponds to the `train`, `validation` and `test` splits of its two configurations (see the [/splits](./splits) guide):
The endpoint response is a JSON containing a list of the dataset's files in the Parquet format. For example, the [`duorc`](https://huggingface.co/datasets/duorc) dataset has six Parquet files, which corresponds to the `test`, `train` and `validation` splits of its two configurations, `ParaphraseRC` and `SelfRC` (see the [List splits and configurations](./splits) guide for more details about splits and configurations).

The endpoint also gives the filename and size of each file:

```json
{
Expand Down Expand Up @@ -103,134 +105,56 @@ The endpoint response is a JSON containing a list of the dataset's parquet files
}
```

The dataset can then be accessed directly through the parquet files:

```python
import pandas as pd
url = "https://huggingface.co/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/ParaphraseRC/duorc-train.parquet"
pd.read_parquet(url).title.value_counts().head()
# Dracula 422
# The Three Musketeers 412
# Superman 193
# Jane Eyre 190
# The Thing 189
# Name: title, dtype: int64
```

## Sharded parquet files
## Sharded Parquet files

The big datasets are partitioned in parquet files (shards) of about 1 GiB. The file name gives the index of the shard and the total number of shards. For example, the `train` split of the [`alexandrainst/danish-wit`](https://datasets-server.huggingface.co/parquet?dataset=alexandrainst/danish-wit) dataset is partitioned into 9 shards, from `parquet-train-00000-of-00009.parquet` to `parquet-train-00008-of-00009.parquet`:
Big datasets are partitioned into Parquet files (shards) of about 1GB each. The filename contains the name of the dataset, the split, the shard index, and the total number of shards (`dataset-name-train-0000-of-0004.parquet`). For example, the `train` split of the [`amazon_polarity`](https://datasets-server.huggingface.co/parquet?dataset=amazon_polarity) dataset is partitioned into 4 shards:

```json
{
"parquet_files": [
{
"dataset": "alexandrainst/danish-wit",
"config": "alexandrainst--danish-wit",
"dataset": "amazon_polarity",
"config": "amazon_polarity",
"split": "test",
"url": "https://huggingface.co/datasets/alexandrainst/danish-wit/resolve/refs%2Fconvert%2Fparquet/alexandrainst--danish-wit/parquet-test.parquet",
"filename": "parquet-test.parquet",
"size": 48781933
},
{
"dataset": "alexandrainst/danish-wit",
"config": "alexandrainst--danish-wit",
"split": "train",
"url": "https://huggingface.co/datasets/alexandrainst/danish-wit/resolve/refs%2Fconvert%2Fparquet/alexandrainst--danish-wit/parquet-train-00000-of-00009.parquet",
"filename": "parquet-train-00000-of-00009.parquet",
"size": 937127291
},
{
"dataset": "alexandrainst/danish-wit",
"config": "alexandrainst--danish-wit",
"split": "train",
"url": "https://huggingface.co/datasets/alexandrainst/danish-wit/resolve/refs%2Fconvert%2Fparquet/alexandrainst--danish-wit/parquet-train-00001-of-00009.parquet",
"filename": "parquet-train-00001-of-00009.parquet",
"size": 925920565
},
{
"dataset": "alexandrainst/danish-wit",
"config": "alexandrainst--danish-wit",
"split": "train",
"url": "https://huggingface.co/datasets/alexandrainst/danish-wit/resolve/refs%2Fconvert%2Fparquet/alexandrainst--danish-wit/parquet-train-00002-of-00009.parquet",
"filename": "parquet-train-00002-of-00009.parquet",
"size": 940390661
},
{
"dataset": "alexandrainst/danish-wit",
"config": "alexandrainst--danish-wit",
"split": "train",
"url": "https://huggingface.co/datasets/alexandrainst/danish-wit/resolve/refs%2Fconvert%2Fparquet/alexandrainst--danish-wit/parquet-train-00003-of-00009.parquet",
"filename": "parquet-train-00003-of-00009.parquet",
"size": 934549621
},
{
"dataset": "alexandrainst/danish-wit",
"config": "alexandrainst--danish-wit",
"url": "https://huggingface.co/datasets/amazon_polarity/resolve/refs%2Fconvert%2Fparquet/amazon_polarity/amazon_polarity-test.parquet",
"filename": "amazon_polarity-test.parquet",
"size": 117422359
},
{
"dataset": "amazon_polarity",
"config": "amazon_polarity",
"split": "train",
"url": "https://huggingface.co/datasets/alexandrainst/danish-wit/resolve/refs%2Fconvert%2Fparquet/alexandrainst--danish-wit/parquet-train-00004-of-00009.parquet",
"filename": "parquet-train-00004-of-00009.parquet",
"size": 493004154
"url": "https://huggingface.co/datasets/amazon_polarity/resolve/refs%2Fconvert%2Fparquet/amazon_polarity/amazon_polarity-train-00000-of-00004.parquet",
"filename": "amazon_polarity-train-00000-of-00004.parquet",
"size": 320281121
},
{
"dataset": "alexandrainst/danish-wit",
"config": "alexandrainst--danish-wit",
"dataset": "amazon_polarity",
"config": "amazon_polarity",
"split": "train",
"url": "https://huggingface.co/datasets/alexandrainst/danish-wit/resolve/refs%2Fconvert%2Fparquet/alexandrainst--danish-wit/parquet-train-00005-of-00009.parquet",
"filename": "parquet-train-00005-of-00009.parquet",
"size": 942848888
"url": "https://huggingface.co/datasets/amazon_polarity/resolve/refs%2Fconvert%2Fparquet/amazon_polarity/amazon_polarity-train-00001-of-00004.parquet",
"filename": "amazon_polarity-train-00001-of-00004.parquet",
"size": 320627716
},
{
"dataset": "alexandrainst/danish-wit",
"config": "alexandrainst--danish-wit",
"dataset": "amazon_polarity",
"config": "amazon_polarity",
"split": "train",
"url": "https://huggingface.co/datasets/alexandrainst/danish-wit/resolve/refs%2Fconvert%2Fparquet/alexandrainst--danish-wit/parquet-train-00006-of-00009.parquet",
"filename": "parquet-train-00006-of-00009.parquet",
"size": 933373843
"url": "https://huggingface.co/datasets/amazon_polarity/resolve/refs%2Fconvert%2Fparquet/amazon_polarity/amazon_polarity-train-00002-of-00004.parquet",
"filename": "amazon_polarity-train-00002-of-00004.parquet",
"size": 320587882
},
{
"dataset": "alexandrainst/danish-wit",
"config": "alexandrainst--danish-wit",
"dataset": "amazon_polarity",
"config": "amazon_polarity",
"split": "train",
"url": "https://huggingface.co/datasets/alexandrainst/danish-wit/resolve/refs%2Fconvert%2Fparquet/alexandrainst--danish-wit/parquet-train-00007-of-00009.parquet",
"filename": "parquet-train-00007-of-00009.parquet",
"size": 936939176
},
{
"dataset": "alexandrainst/danish-wit",
"config": "alexandrainst--danish-wit",
"split": "train",
"url": "https://huggingface.co/datasets/alexandrainst/danish-wit/resolve/refs%2Fconvert%2Fparquet/alexandrainst--danish-wit/parquet-train-00008-of-00009.parquet",
"filename": "parquet-train-00008-of-00009.parquet",
"size": 946933048
},
{
"dataset": "alexandrainst/danish-wit",
"config": "alexandrainst--danish-wit",
"split": "val",
"url": "https://huggingface.co/datasets/alexandrainst/danish-wit/resolve/refs%2Fconvert%2Fparquet/alexandrainst--danish-wit/parquet-val.parquet",
"filename": "parquet-val.parquet",
"size": 11437355
"url": "https://huggingface.co/datasets/amazon_polarity/resolve/refs%2Fconvert%2Fparquet/amazon_polarity/amazon_polarity-train-00003-of-00004.parquet",
"filename": "amazon_polarity-train-00003-of-00004.parquet",
"size": 66515954
}
]
}
],
"pending": [],
"failed": []}
```

The shards can be concatenated:

```python
import pandas as pd
import requests
r = requests.get("https://datasets-server.huggingface.co/parquet?dataset=alexandrainst/danish-wit")
j = r.json()
urls = [f['url'] for f in j['parquet_files'] if f['split'] == 'train']
dfs = [pd.read_parquet(url) for url in urls]
df = pd.concat(dfs)
df.mime_type.value_counts().head()
# image/jpeg 140919
# image/png 18608
# image/svg+xml 6171
# image/gif 1030
# image/webp 1
# Name: mime_type, dtype: int64
```
To read and query the Parquet files, take a look at the [Query datasets from Datasets Server](parquet_process) guide.

0 comments on commit 3a98ba6

Please sign in to comment.