add tour about Features (#53)

* docs: Add translation Arrow documentation * docs: Add Features tour * docs: Add tour about class Features.
datawhalechina · Nov 6, 2024 · 36c2cb8 · 36c2cb8
1 parent 23fbf11
commit 36c2cb8
Show file tree

Hide file tree

Showing 14 changed files with 388 additions and 5 deletions.
diff --git a/docs/chapter1/arrow_tour/arrow_tour.md b/docs/chapter1/arrow_tour/arrow_tour.md
@@ -0,0 +1,62 @@
+---
+comments: true
+title: Arrow介绍
+---
+
+!!! quote "翻译自[HuggingFace Arrow](https://huggingface.co/docs/datasets/main/en/about_arrow)"
+
+## `Arrow`是什么？
+
+`Arrow`是一种数据格式，可以快速处理和移动大量数据。它使用列式内存布局存储数据，它的**标准格式**具有以下优点：
+
+| 特征       | 描述                                                                             |
+| ---------- | -------------------------------------------------------------------------------- |
+| 读取方式   | 支持**零拷贝**读取，从而消除了几乎所有序列化开销。                               |
+| 跨语言支持 | 支持多种编程语言。                                                                 |
+| 存储方式   | 面向列的存储，在查询和处理数据切片或列时速度更快。                               |
+| 兼容性     | 数据可以无缝传递给主流机器学习工具，如`NumPy`、`Pandas`、`PyTorch`和`TensorFlow`。 |
+| 列类型     | 支持多种列类型，甚至支持嵌套列类型。                                               |
+
+## 内存映射
+
+`Datasets`使用`Arrow`作为其本地缓存系统。它允许数据集由磁盘缓存作为后盾，该缓存被内存映射以实现快速查找。
+
+这种架构允许在设备内存较小的机器上使用大型数据集。
+
+例如，加载完整的英文维基百科数据集只需要几兆字节的内存：
+
+```python
+import os
+import psutil
+import timeit
+from datasets import load_dataset
+
+mem_before = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024)
+wiki = load_dataset("wikipedia", "20220301.en", split="train")
+mem_after = psutil.Process(os.getpid()).memory_info().rss / (1024 * 1024)
+
+print(f"RAM memory used: {(mem_after - mem_before)} MB")
+```
+
+之所以能做到这一点，是因为`Arrow`数据实际上是从磁盘内存映射的，而不是直接加载到内存中的。内存映射允许访问磁盘上的数据，并利用虚拟内存功能进行快速查找。
+
+## 性能
+
+使用Arrow在内存映射数据集中进行迭合是快速的。在笔记本电脑上遍用维基百科，速度为1-3 Gbit/s。
+
+```python
+s = """batch_size = 1000
+for batch in wiki.iter(batch_size):
+    ...
+"""
+
+elapsed_time = timeit.timeit(stmt=s, number=1, globals=globals())
+print(
+    f"Time to iterate over the {wiki.dataset_size >> 30} GB dataset: {elapsed_time:.1f} sec, "
+    f"ie. {float(wiki.dataset_size >> 27)/elapsed_time:.1f} Gb/s"
+)
+```
+
+```python
+Time to iterate over the 18 GB dataset: 31.8 sec, ie. 4.8 Gb/s
+```
diff --git a/docs/chapter1/custom_dataset/custom_dataset.md b/docs/chapter1/custom_dataset/custom_dataset.md
@@ -0,0 +1,6 @@
+---
+comments: true
+title: 自定义数据集
+---
+
+## 前言
diff --git a/docs/chapter1/datasets.md → docs/chapter1/dataset_tour/datasets.md b/docs/chapter1/datasets.md → docs/chapter1/dataset_tour/datasets.md
@@ -1,7 +1,8 @@
 ---
 comments: true
-title: Datasets
+title: Datasets介绍
 ---
+
 ![datasets](./imgs/datasets.png)
 
 ## 前言

diff --git a/docs/chapter1/imgs/cmrc.png → docs/chapter1/dataset_tour/imgs/cmrc.png b/docs/chapter1/imgs/cmrc.png → docs/chapter1/dataset_tour/imgs/cmrc.png
diff --git a/docs/chapter1/imgs/cmrc_split.png → ...chapter1/dataset_tour/imgs/cmrc_split.png b/docs/chapter1/imgs/cmrc_split.png → ...chapter1/dataset_tour/imgs/cmrc_split.png
diff --git a/docs/chapter1/imgs/data_hub.png → docs/chapter1/dataset_tour/imgs/data_hub.png b/docs/chapter1/imgs/data_hub.png → docs/chapter1/dataset_tour/imgs/data_hub.png
diff --git a/docs/chapter1/imgs/datasets.png → docs/chapter1/dataset_tour/imgs/datasets.png b/docs/chapter1/imgs/datasets.png → docs/chapter1/dataset_tour/imgs/datasets.png
diff --git a/docs/chapter1/imgs/ruozhiba.png → docs/chapter1/dataset_tour/imgs/ruozhiba.png b/docs/chapter1/imgs/ruozhiba.png → docs/chapter1/dataset_tour/imgs/ruozhiba.png
diff --git a/docs/chapter1/imgs/ruozhiba_split.png → ...ter1/dataset_tour/imgs/ruozhiba_split.png b/docs/chapter1/imgs/ruozhiba_split.png → ...ter1/dataset_tour/imgs/ruozhiba_split.png
diff --git a/docs/chapter1/datasets_index.md b/docs/chapter1/datasets_index.md
@@ -5,4 +5,7 @@ title: 索引
 
 主页
 
-- [`Datasets`简介](./datasets.md)
+- [`Arrow`介绍](./arrow_tour/arrow_tour.md)
+- [`Datasets`介绍](./dataset_tour/datasets.md)
+- [`Features`介绍](./features_tour/features_tour.md)
+- [自定义数据集](./custom_dataset/custom_dataset.md)
diff --git a/docs/chapter1/features_tour/features_tour.md b/docs/chapter1/features_tour/features_tour.md
@@ -0,0 +1,308 @@
+---
+comments: true
+title: Features介绍
+---
+
+## 前言
+
+`Features`类是一种用来定义数据集结构的特殊字典，该字典期望的格式为`dict[str, FieldType]`，其中键对应列名，值对应相应的数据类型。
+
+有关受支持的`FieldType`类型可以查阅[HuggingFace关于`FieldType`的文档](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Features)，以下是受支持的数据类型及其描述。
+
+| `FieldType`                                   | 描述                                                                                                                       |
+| --------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
+| `Value`                                       | 指定单一数据类型，例如`int64`或`string`。                                                                                  |
+| `ClassLabel`                                  | 指定一组预定义的类别，这些类别可以具有与之关联的标签，并且将作为整数存储在数据集中。比如`['bad', 'ok', 'good']`            |
+| `dict`                                        | 指定一个复合特征，其中包含子字段到子特征的映射。子字段可以任意方式嵌套。                                                   |
+| `list`, `LargeList`, `Sequence`               | 指定一个复合特征，其中包含一个子特征序列，所有子特征的类型相同。                                                           |
+| `Array2D`, `Array3D`, `Array4D`, `Array5D`    | 用于存储多维数组。                                                                                                         |
+| `Audio`                                       | 用于存储指向音频文件的绝对路径或一个字典，其中包含指向音频文件的相对路径和字节内容。                                       |
+| `Image`                                       | 用于存储指向图像文件的绝对路径、一个`NumPy`数组对象、一个`PIL`对象或一个`dict`，其中包含指向图像文件的相对路径和字节内容。 |
+| `Translation`, `TranslationVariableLanguages` | 特定于机器翻译。                                                                                                           |
+
+## `Features`属性介绍
+
+### 简单数据集定义
+
+```python
+from datasets import Features, Value, ClassLabel
+features = Features(
+    {
+        "text": Value(dtype="string"),
+        "label": ClassLabel(num_classes=3, names=["negative", "positive"]),
+    }
+)
+```
+
+该例子定义了一个包含两个特征的简单数据集结构。
+
+- `text`：字符串类型，用于存储文本数据。
+- `label`：类别标签类型，用于存储情感类别标签，取值为`negative`或`positive`。
+
+```python title="features"
+{
+    "text": Value(dtype="string", id=None),
+    "label": ClassLabel(names=["negative", "positive"], id=None),
+}
+```
+
+### 复合数据集定义
+
+```python
+from datasets import Features, Value, ClassLabel, Sequence
+
+features = Features(
+    {
+        "text": Value(dtype="string"),
+        "entities": Sequence(
+            {
+                "start": Value(dtype="int64"),
+                "end": Value(dtype="int64"),
+                "label": ClassLabel(num_classes=3, names=["PERSON", "ORG", "LOC"]),
+            }
+        ),
+    }
+)
+```
+
+该例子定义了一个包含`entities`复合特征的数据集结构。
+
+- `text`: 字符串类型，用于存储文本数据。
+- `entities`: 序列类型，用于存储文本中的实体信息。每个实体包含三个特征。
+    - `start`: 整数类型，表示实体在文本中的起始位置。
+    - `end`: 整数类型，表示实体在文本中的结束位置。
+    - `label`: 类别标签类型，表示实体的类别，可以是`PERSON`, `ORG`或`LOC`。
+
+```python title="features"
+{
+    "text": Value(dtype="string", id=None),
+    "entities": Sequence(
+        feature={
+            "start": Value(dtype="int64", id=None),
+            "end": Value(dtype="int64", id=None),
+            "label": ClassLabel(names=["PERSON", "ORG", "LOC"], id=None),
+        },
+        length=-1,
+        id=None,
+    ),
+}
+```
+
+### 多维数组
+
+```python
+from datasets import Features, Array2D, Value
+
+features = Features(
+    {
+        "image": Array2D(shape=(224, 224, 3), dtype="float32"),
+        "label": Value("int64"),
+    }
+)
+```
+
+该例子定义了一个包含`image`特征的数据集结构。
+
+- `image`: 多维数组类型，用于存储图像数据，形状为`(224, 224, 3)`，数据类型为`float32`。
+- `label`: 整数类型，用于存储图像的类别标签。
+
+```python title="features"
+{
+    "image": Array2D(shape=(224, 224, 3), dtype="float32", id=None),
+    "label": Value(dtype="int64", id=None),
+}
+```
+
+### 音频数据
+
+```python
+from datasets import Features, Audio, ClassLabel
+
+features = Features(
+    {
+        "audio": Audio(sampling_rate=44100),
+        "label": ClassLabel(num_classes=2, names=["negative", "positive"]),
+    }
+)
+```
+
+该例子定义了一个包含`audio`和`label`特征的数据集结构。
+
+- `audio`: 音频类型，用于存储音频数据，采样率为`44100 Hz`。
+- `label`: 整数类型，用于存储音频情感类别标签。
+
+```python title="features"
+{
+    "audio": Audio(sampling_rate=44100, mono=True, decode=True, id=None),
+    "label": ClassLabel(names=["negative", "positive"], id=None),
+}
+```
+
+### 机器翻译
+
+```python
+from datasets import Features, Translation, Value
+
+features = Features(
+    {
+        "source_text": Value(dtype="string"),
+        "target_text": Translation(languages=["en", "fr"]),
+    }
+)
+```
+
+该例子定义了一个包含`source_text`和`target_text`特征的数据集结构。
+
+- `source_text`: 字符串类型，用于存储源语言文本数据。
+- `target_text`: 翻译类型，用于存储目标语言文本数据，支持英语和法语两种语言。
+
+```python title="features"
+{
+    "source_text": Value(dtype="string", id=None),
+    "target_text": Translation(languages=["en", "fr"], id=None),
+}
+```
+
+### 其他
+
+有关受支持的`Value`数据类型的完整列表，可以查阅[HuggingFace关于`Value`的文档](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Value)，以下是整理出来的常用的数据类型及其描述。
+
+| 数据类型 | 描述           | 数据类型  | 描述               |
+| -------- | -------------- | --------- | ------------------ |
+| `null`   | 表示值不存在   | `float32` | 32位浮点数         |
+| `bool`   | 布尔值         | `float64` | 64位浮点数         |
+| `int32`  | 32位有符号整数 | `date64`  | 日期，包含时间信息 |
+| `int64`  | 64位有符号整数 | `string`  | 文本数据           |
+| $\cdots$ | $\cdots$       | $\cdots$  | $\cdots$           |
+
+下面是数据集`mrpc`的数据集主页，可以看到网页根据`Features`在数据集卡片上正确显示了列名及其数据类型。
+
+<iframe
+  src="https://huggingface.co/datasets/nyu-mll/glue/embed/viewer/mrpc/train"
+  frameborder="0"
+  width="100%"
+  height="560px"
+></iframe>
+
+## `Features`方法介绍
+
+有关受支持的`Features`方法的完整列表，可以查阅[HuggingFace关于`Features`方法的文档](https://huggingface.co/docs/datasets/v3.1.0/en/package_reference/main_classes#datasets.Features)，以下是整理出来的常用方法及其描述。
+
+| 方法                | 说明                                                   |
+| ------------------- | ------------------------------------------------------ |
+| `from_dict`         | 从字典构建`Features`。                                 |
+| `to_dict`           | 返回特征的字典表示。                                   |
+| `copy`              | `Features`对象的深复制。                               |
+| `reorder_fields_as` | 重新排序字段以匹配另一个`Features`对象的顺序。         |
+| `flatten`           | 通过删除嵌套字典并创建具有连接名称的新列来扁平化特征。 |
+| $\cdots$            | $\cdots$                                               |
+
+### `from_dict`方法
+
+```python
+from datasets import Features
+
+Features.from_dict({"text": { "_type": "Value", "dtype": "string", "id": None}})
+```
+
+该方法使用从`from_dict`方法从字典创建`Features`对象。
+
+```python title='Features.from_dict({"text": {"_type": "Value", "dtype": "string", "id": None,}})'
+{"text": Value(dtype="string", id=None)}
+```
+
+### `to_dict`方法
+
+```python
+from datasets import Features, Value
+
+features = Features(
+    {
+        "text": Value(dtype="string"),
+        "label": Value(dtype="int64"),
+    }
+)
+```
+
+该例子首先创建了`features`，然后利用`to_dict`方法返回了字典格式的`features`。
+
+```python title="features.to_dict()"
+{
+    "text": {"dtype": "string", "_type": "Value"},
+    "label": {"dtype": "int64", "_type": "Value"},
+}
+```
+
+### `reorder_fields_as`方法
+
+```python
+from datasets import Features, Value, ClassLabel
+
+features = Features(
+    {
+        "text": Value("string"),
+        "label": ClassLabel(names=["positive", "negative"]),
+    }
+)
+
+other_features = Features(
+    {
+        "label": ClassLabel(names=["positive", "negative"]),
+        "text": Value("string"),
+    }
+)
+reordered_features = features.reorder_fields_as(other_features)
+```
+
+该例子创建字段顺序不同的两个`Features`对象，然后利用`reorder_fields_as`重新排序`features`字段以匹配`other_features`字段的顺序。
+
+```python title="reordered_features"
+{
+    "label": ClassLabel(names=["positive", "negative"], id=None),
+    "text": Value(dtype="string", id=None),
+}
+```
+
+### `flatten`方法
+
+```python
+from datasets import Features, Value
+
+nested_features = Features(
+    {
+        "a": Value("string"),
+        "b": {
+            "c": Value("int32"),
+            "d": Value("float32"),
+        },
+    }
+)
+
+flattened_features = nested_features.flatten()
+```
+
+```python title="nested_features"
+{
+    "a": Value(dtype="string", id=None),
+    "b": {"c": Value(dtype="int32", id=None), "d": Value(dtype="float32", id=None)},
+}
+```
+
+该例子利用`flatten`方法删除嵌套字典并创建具有连接名称的新列来扁平化特征。
+
+```python title="flattened_features"
+{
+    "a": Value(dtype="string", id=None),
+    "b.c": Value(dtype="int32", id=None),
+    "b.d": Value(dtype="float32", id=None),
+}
+```
+
+## 参考资料
+
+<div class="grid cards" markdown>
+
+1. [HuggingFace关于`FieldType`/`Features`的文档](https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Features)
+2. [HuggingFace关于`Value`的文档](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Value)
+
+</div>
diff --git a/docs/chapter1/map.md b/docs/chapter1/map.md