diff --git a/docs/source/upload_dataset.mdx b/docs/source/upload_dataset.mdx index 57d54a3b58a..af63a65f7e8 100644 --- a/docs/source/upload_dataset.mdx +++ b/docs/source/upload_dataset.mdx @@ -25,12 +25,6 @@ A repository hosts all your dataset files, including the revision history, makin Text file extensions are not tracked by Git LFS by default, and if they're greater than 10MB, they will not be committed and uploaded. Take a look at the `.gitattributes` file in your repository for a complete list of tracked file extensions. For this tutorial, you can use the following sample `.csv` files since they're small: train.csv, test.csv. - - -For additional dataset configuration options, like defining multiple configurations or enabling streaming, you'll need to write a dataset loading script. Check out how to write a dataset loading script for text, audio, and image datasets. - - -
diff --git a/docs/source/use_with_pytorch.mdx b/docs/source/use_with_pytorch.mdx index ca29318ccea..eeff73ef864 100644 --- a/docs/source/use_with_pytorch.mdx +++ b/docs/source/use_with_pytorch.mdx @@ -184,37 +184,6 @@ Reloading the dataset inside a worker doesn't fill up your RAM, since it simply >>> dataloader = DataLoader(ds, batch_size=32, num_workers=4) ``` -#### Use a BatchSampler (torch<=1.12.1) - -For old versions of PyTorch, using a `BatchSampler` can speed up data loading. -Indeed if you are using `torch<=1.12.1`, the PyTorch `DataLoader` load batches of data from a dataset one by one like this: - -```py -batch = [dataset[idx] for idx in range(start, end)] -``` - -Unfortunately, this does numerous read operations on the dataset. -It is more efficient to query batches of examples using a list: - -```py -batch = dataset[start:end] -# or -batch = dataset[list_of_indices] -``` - -For the PyTorch `DataLoader` to query batches using a list, you can use a `BatchSampler`: - -```py ->>> from torch.utils.data.sampler import BatchSampler, RandomSampler ->>> batch_sampler = BatchSampler(RandomSampler(ds), batch_size=32, drop_last=False) ->>> dataloader = DataLoader(ds, batch_sampler=batch_sampler) -``` - -Moreover, this is particularly useful if you used [`set_transform`] to apply a transform on-the-fly when examples are accessed. -You must use a `BatchSampler` if you want the transform to be given full batches instead of receiving `batch_size` times one single element. - -Recent versions of PyTorch use a list of indices, so a `BatchSampler` is not needed to get the best speed even if you used [`set_transform`]. - ### Stream data Stream a dataset by loading it as an [`IterableDataset`]. This allows you to progressively iterate over a remote dataset without downloading it on disk and or over local data files.