Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc doc improvements #6074

Merged
merged 1 commit into from
Jul 27, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 0 additions & 6 deletions docs/source/upload_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,6 @@ A repository hosts all your dataset files, including the revision history, makin

Text file extensions are not tracked by Git LFS by default, and if they're greater than 10MB, they will not be committed and uploaded. Take a look at the `.gitattributes` file in your repository for a complete list of tracked file extensions. For this tutorial, you can use the following sample `.csv` files since they're small: <a href="https://huggingface.co/datasets/stevhliu/demo/raw/main/train.csv" download>train.csv</a>, <a href="https://huggingface.co/datasets/stevhliu/demo/raw/main/test.csv" download>test.csv</a>.

<Tip warning={true}>

For additional dataset configuration options, like defining multiple configurations or enabling streaming, you'll need to write a dataset loading script. Check out how to write a dataset loading script for <a href="https://huggingface.co/docs/datasets/dataset_script#create-a-dataset-loading-script"><span class="underline decoration-green-400 decoration-2 font-semibold">text</span></a>, <a href="https://huggingface.co/docs/datasets/audio_dataset#loading-script"><span class="underline decoration-pink-400 decoration-2 font-semibold">audio</span></a>, and <a href="https://huggingface.co/docs/datasets/image_dataset#loading-script"><span class="underline decoration-yellow-400 decoration-2 font-semibold">image</span></a> datasets.

</Tip>

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/upload_files.png"/>
</div>
Expand Down
31 changes: 0 additions & 31 deletions docs/source/use_with_pytorch.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -184,37 +184,6 @@ Reloading the dataset inside a worker doesn't fill up your RAM, since it simply
>>> dataloader = DataLoader(ds, batch_size=32, num_workers=4)
```

#### Use a BatchSampler (torch<=1.12.1)

For old versions of PyTorch, using a `BatchSampler` can speed up data loading.
Indeed if you are using `torch<=1.12.1`, the PyTorch `DataLoader` load batches of data from a dataset one by one like this:

```py
batch = [dataset[idx] for idx in range(start, end)]
```

Unfortunately, this does numerous read operations on the dataset.
It is more efficient to query batches of examples using a list:

```py
batch = dataset[start:end]
# or
batch = dataset[list_of_indices]
```

For the PyTorch `DataLoader` to query batches using a list, you can use a `BatchSampler`:

```py
>>> from torch.utils.data.sampler import BatchSampler, RandomSampler
>>> batch_sampler = BatchSampler(RandomSampler(ds), batch_size=32, drop_last=False)
>>> dataloader = DataLoader(ds, batch_sampler=batch_sampler)
```

Moreover, this is particularly useful if you used [`set_transform`] to apply a transform on-the-fly when examples are accessed.
You must use a `BatchSampler` if you want the transform to be given full batches instead of receiving `batch_size` times one single element.

Recent versions of PyTorch use a list of indices, so a `BatchSampler` is not needed to get the best speed even if you used [`set_transform`].

### Stream data

Stream a dataset by loading it as an [`IterableDataset`]. This allows you to progressively iterate over a remote dataset without downloading it on disk and or over local data files.
Expand Down