update docs on N-dim arrays (huggingface#6956)

* update docs on N-dim arrays * Update use_with_tensorflow.mdx * Update use_with_jax.mdx * Update use_with_jax.mdx * Update use_with_tensorflow.mdx
EthanSteinberg · Jun 4, 2024 · 336512d · 336512d
1 parent f717006
commit 336512d
Show file tree

Hide file tree

Showing 3 changed files with 62 additions and 16 deletions.
diff --git a/docs/source/use_with_jax.mdx b/docs/source/use_with_jax.mdx
@@ -79,18 +79,44 @@ device which is `jax.devices()[0]`.
 
 ## N-dimensional arrays
 
-If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
-In particular, a JAX formatted dataset outputs a `DeviceArray` object, which is a numpy-like array, so it does not
-need the [`Array`] feature type to be specified as opposed to PyTorch or TensorFlow formatters.
+If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:
 
 ```py
 >>> from datasets import Dataset
->>> data = [[[1, 2],[3, 4]], [[5, 6],[7, 8]]]
+>>> data = [[[1, 2],[3, 4]], [[5, 6],[7, 8]]]  # fixed shape
 >>> ds = Dataset.from_dict({"data": data})
 >>> ds = ds.with_format("jax")
 >>> ds[0]
-{'data': DeviceArray([[1, 2],
-             [3, 4]], dtype=int32)}
+{'data': Array([[1, 2],
+        [3, 4]], dtype=int32)}
+```
+
+```py
+>>> from datasets import Dataset
+>>> data = [[[1, 2],[3]], [[4, 5, 6],[7, 8]]]  # varying shape
+>>> ds = Dataset.from_dict({"data": data})
+>>> ds = ds.with_format("jax")
+>>> ds[0]
+{'data': [Array([1, 2], dtype=int32), Array([3], dtype=int32)]}
+```
+
+However this logic often requires slow shape comparisions and data copies, to avoid this you must explicitly use the [`Array`] feature type and specify the shape of your tensors:
+
+```py
+>>> from datasets import Dataset, Features, Array2D
+>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
+>>> features = Features({"data": Array2D(shape=(2, 2), dtype='int32')})
+>>> ds = Dataset.from_dict({"data": data}, features=features)
+>>> ds = ds.with_format("torch")
+>>> ds[0]
+{'data': Array([[1, 2],
+        [3, 4]], dtype=int32)}
+>>> ds[:2]
+{'data': Array([[[1, 2],
+         [3, 4]],
+
+        [[5, 6],
+         [7, 8]]], dtype=int32)}
 ```
 
 ## Other feature types

diff --git a/docs/source/use_with_pytorch.mdx b/docs/source/use_with_pytorch.mdx
@@ -40,19 +40,28 @@ To load the data as tensors on a GPU, specify the `device` argument:
 
 ## N-dimensional arrays
 
-If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
-In particular, a PyTorch formatted dataset outputs nested lists instead of a single tensor:
+If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:
 
 ```py
 >>> from datasets import Dataset
->>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
+>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]  # fixed shape
+>>> ds = Dataset.from_dict({"data": data})
+>>> ds = ds.with_format("torch")
+>>> ds[0]
+{'data': tensor([[1, 2],
+         [3, 4]])}
+```
+
+```py
+>>> from datasets import Dataset
+>>> data = [[[1, 2],[3]],[[4, 5, 6],[7, 8]]]  # varying shape
 >>> ds = Dataset.from_dict({"data": data})
 >>> ds = ds.with_format("torch")
 >>> ds[0]
-{'data': [tensor([1, 2]), tensor([3, 4])]}
+{'data': [tensor([1, 2]), tensor([3])]}
 ```
 
-To get a single tensor, you must explicitly use the [`Array`] feature type and specify the shape of your tensors:
+However this logic often requires slow shape comparisions and data copies, to avoid this you must explicitly use the [`Array`] feature type and specify the shape of your tensors:
 
 ```py
 >>> from datasets import Dataset, Features, Array2D

diff --git a/docs/source/use_with_tensorflow.mdx b/docs/source/use_with_tensorflow.mdx
@@ -41,19 +41,30 @@ array([[1, 2],
 
 ## N-dimensional arrays
 
-If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
-In particular, a TensorFlow formatted dataset outputs a `RaggedTensor` instead of a single tensor:
+If your dataset consists of N-dimensional arrays, you will see that by default they are considered as the same tensor if the shape is fixed:
+Otherwise, a TensorFlow formatted dataset outputs a `RaggedTensor` instead of a single tensor:
 
 ```py
 >>> from datasets import Dataset
->>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
+>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]  # fixed shape
 >>> ds = Dataset.from_dict({"data": data})
 >>> ds = ds.with_format("tf")
 >>> ds[0]
-{'data': <tf.RaggedTensor [[1, 2], [3, 4]]>}
+{'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy=
+ array([[1, 2],
+        [3, 4]])>}
+```
+
+```py
+>>> from datasets import Dataset
+>>> data = [[[1, 2],[3]],[[4, 5, 6],[7, 8]]]  # varying shape
+>>> ds = Dataset.from_dict({"data": data})
+>>> ds = ds.with_format("torch")
+>>> ds[0]
+{'data': <tf.RaggedTensor [[1, 2], [3]]>}
 ```
 
-To get a single tensor, you must explicitly use the [`Array`] feature type and specify the shape of your tensors:
+However this logic often requires slow shape comparisions and data copies, to avoid this you must explicitly use the [`Array`] feature type and specify the shape of your tensors:
 
 ```py
 >>> from datasets import Dataset, Features, Array2D