Skip to content

Commit

Permalink
update docs on N-dim arrays (huggingface#6956)
Browse files Browse the repository at this point in the history
* update docs on N-dim arrays

* Update use_with_tensorflow.mdx

* Update use_with_jax.mdx

* Update use_with_jax.mdx

* Update use_with_tensorflow.mdx
  • Loading branch information
lhoestq authored Jun 4, 2024
1 parent f717006 commit 336512d
Show file tree
Hide file tree
Showing 3 changed files with 62 additions and 16 deletions.
38 changes: 32 additions & 6 deletions docs/source/use_with_jax.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -79,18 +79,44 @@ device which is `jax.devices()[0]`.

## N-dimensional arrays

If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
In particular, a JAX formatted dataset outputs a `DeviceArray` object, which is a numpy-like array, so it does not
need the [`Array`] feature type to be specified as opposed to PyTorch or TensorFlow formatters.
If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:

```py
>>> from datasets import Dataset
>>> data = [[[1, 2],[3, 4]], [[5, 6],[7, 8]]]
>>> data = [[[1, 2],[3, 4]], [[5, 6],[7, 8]]] # fixed shape
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("jax")
>>> ds[0]
{'data': DeviceArray([[1, 2],
[3, 4]], dtype=int32)}
{'data': Array([[1, 2],
[3, 4]], dtype=int32)}
```

```py
>>> from datasets import Dataset
>>> data = [[[1, 2],[3]], [[4, 5, 6],[7, 8]]] # varying shape
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("jax")
>>> ds[0]
{'data': [Array([1, 2], dtype=int32), Array([3], dtype=int32)]}
```

However this logic often requires slow shape comparisions and data copies, to avoid this you must explicitly use the [`Array`] feature type and specify the shape of your tensors:

```py
>>> from datasets import Dataset, Features, Array2D
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
>>> features = Features({"data": Array2D(shape=(2, 2), dtype='int32')})
>>> ds = Dataset.from_dict({"data": data}, features=features)
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': Array([[1, 2],
[3, 4]], dtype=int32)}
>>> ds[:2]
{'data': Array([[[1, 2],
[3, 4]],

[[5, 6],
[7, 8]]], dtype=int32)}
```

## Other feature types
Expand Down
19 changes: 14 additions & 5 deletions docs/source/use_with_pytorch.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -40,19 +40,28 @@ To load the data as tensors on a GPU, specify the `device` argument:

## N-dimensional arrays

If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
In particular, a PyTorch formatted dataset outputs nested lists instead of a single tensor:
If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:

```py
>>> from datasets import Dataset
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]] # fixed shape
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': tensor([[1, 2],
[3, 4]])}
```

```py
>>> from datasets import Dataset
>>> data = [[[1, 2],[3]],[[4, 5, 6],[7, 8]]] # varying shape
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': [tensor([1, 2]), tensor([3, 4])]}
{'data': [tensor([1, 2]), tensor([3])]}
```

To get a single tensor, you must explicitly use the [`Array`] feature type and specify the shape of your tensors:
However this logic often requires slow shape comparisions and data copies, to avoid this you must explicitly use the [`Array`] feature type and specify the shape of your tensors:

```py
>>> from datasets import Dataset, Features, Array2D
Expand Down
21 changes: 16 additions & 5 deletions docs/source/use_with_tensorflow.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -41,19 +41,30 @@ array([[1, 2],

## N-dimensional arrays

If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
In particular, a TensorFlow formatted dataset outputs a `RaggedTensor` instead of a single tensor:
If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:
Otherwise, a TensorFlow formatted dataset outputs a `RaggedTensor` instead of a single tensor:

```py
>>> from datasets import Dataset
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]] # fixed shape
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("tf")
>>> ds[0]
{'data': <tf.RaggedTensor [[1, 2], [3, 4]]>}
{'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy=
array([[1, 2],
[3, 4]])>}
```

```py
>>> from datasets import Dataset
>>> data = [[[1, 2],[3]],[[4, 5, 6],[7, 8]]] # varying shape
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': <tf.RaggedTensor [[1, 2], [3]]>}
```

To get a single tensor, you must explicitly use the [`Array`] feature type and specify the shape of your tensors:
However this logic often requires slow shape comparisions and data copies, to avoid this you must explicitly use the [`Array`] feature type and specify the shape of your tensors:

```py
>>> from datasets import Dataset, Features, Array2D
Expand Down

0 comments on commit 336512d

Please sign in to comment.