`to_tf_dataset` consumes too much memory #5855

massquantity · 2023-05-14T01:22:29Z

Describe the bug

Hi, I'm using to_tf_dataset to convert a large dataset to tf.data.Dataset. I observed that the data loading before training took a lot of time and memory, even with batch_size=1.

After some digging, i believe the reason lies in the shuffle behavior. The source code uses len(dataset) as the buffer_size, which may load all the data into the memory, and the tf.data doc also states that "While large buffer_sizes shuffle more thoroughly, they can take a lot of memory, and significant time to fill".

Steps to reproduce the bug

from datasets import Dataset

def gen():  # some large data
    for i in range(50000000):
        yield {"data": i}

ds = Dataset.from_generator(gen, cache_dir="./huggingface")

tf_ds = ds.to_tf_dataset(
    batch_size=64,
    shuffle=False,  # no shuffle
    drop_remainder=False,
    prefetch=True,
)

# fast and memory friendly 🤗
for batch in tf_ds: 
    ...

tf_ds_shuffle = ds.to_tf_dataset(
    batch_size=64,
    shuffle=True,
    drop_remainder=False,
    prefetch=True,
)

# slow and memory hungry for simple iteration 😱
for batch in tf_ds_shuffle: 
    ...

Expected behavior

Shuffling should not load all the data into the memory. Would adding a buffer_size parameter in the to_tf_dataset API alleviate the problem?

Environment info

datasets version: 2.11.0
Platform: Linux-5.17.1-051701-generic-x86_64-with-glibc2.17
Python version: 3.8.13
Huggingface_hub version: 0.13.4
PyArrow version: 11.0.0
Pandas version: 1.4.3

The text was updated successfully, but these errors were encountered:

lhoestq · 2023-05-15T09:57:03Z

Cc @amyeroberts @Rocketknight1

Indded I think it's because it does something like this under the hood when there's no multiprocessing:

tf_dataset = tf_dataset.shuffle(len(dataset))

PS: with multiprocessing it appears to be different:

indices = np.arange(len(dataset))
if shuffle:
    np.random.shuffle(indices)

Rocketknight1 · 2023-05-15T12:20:20Z

Hi @massquantity, the dataset being shuffled there is not the full dataset. If you look at the line above, the dataset is actually just a single indices array at that point, and that array is the only thing that gets fully loaded into memory and shuffled. We then load samples from the dataset by applying a transform function to the shuffled dataset, which fetches samples based on the indices it receives.

If your dataset is really gigantic, then this index tensor might be a memory issue, but since it's just an int64 tensor it will only use 1GB of memory per 125 million samples.

Still, if you're encountering memory issues, there might be another cause here - can you share some code to reproduce the error, or does it depend on some internal/proprietary dataset?

massquantity · 2023-05-15T13:06:47Z

Hi @Rocketknight1, you're right and I also noticed that only indices are used in shuffling. My data has shape (50000000, 10), but really the problem doesn't relate to a specific dataset. Simply running the following code costs me 10GB of memory.

from datasets import Dataset

def gen():
    for i in range(50000000):
        yield {"data": i}

ds = Dataset.from_generator(gen, cache_dir="./huggingface")

tf_ds = ds.to_tf_dataset(
    batch_size=1,
    shuffle=True,
    drop_remainder=False,
    prefetch=True,
)
tf_ds = iter(tf_ds)
next(tf_ds)
# {'data': <tf.Tensor: shape=(1,), dtype=int64, numpy=array([0])>}

I just realized maybe it was an issue from tensorflow (I'm using tf 2.12). So I tried the following code, and it used 10GB of memory too.

import numpy as np
import tensorflow as tf

data_size = 50000000
tf_dataset = tf.data.Dataset.from_tensor_slices(np.arange(data_size))
tf_dataset = iter(tf_dataset.shuffle(data_size))
next(tf_dataset)
# <tf.Tensor: shape=(), dtype=int64, numpy=24774043>

By the way, as @lhoestq mentioned, multiprocessing uses numpy shuffling, and it uses less than 1 GB of memory:

tf_ds_mp = ds.to_tf_dataset(
    batch_size=1,
    shuffle=True,
    drop_remainder=False,
    prefetch=True,
    num_workers=2,
)

Rocketknight1 · 2023-05-15T13:25:30Z

Thanks for that reproduction script - I've confirmed the same issue is occurring for me. Investigating it now!

Rocketknight1 · 2023-05-15T13:45:15Z

Update: The memory usage is occurring in creation of the index and shuffle buffer. You can reproduce it very simply with:

import tensorflow as tf
indices = tf.range(50_000_000, dtype=tf.int64)
dataset = tf.data.Dataset.from_tensor_slices(indices)
dataset = dataset.shuffle(len(dataset))
print(next(iter(dataset))

When I wrote this code I thought tf.data had an optimization for shuffling an entire tensor that wouldn't create the entire shuffle buffer, but evidently it's just creating the enormous buffer in memory. I'll see if I can find a more efficient way to do this - we might end up moving everything to the numpy multiprocessing path to avoid it.

Rocketknight1 · 2023-05-15T15:36:23Z

I opened a PR to fix this - will continue the discussion there!

Rocketknight1 mentioned this issue May 15, 2023

Use a new low-memory approach for tf dataset index shuffling #5863

Merged

massquantity mentioned this issue May 21, 2023

[tf2] Use multiprocessing data loading massquantity/tdfo#1

Merged

Rocketknight1 closed this as completed in #5863 Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`to_tf_dataset` consumes too much memory #5855

`to_tf_dataset` consumes too much memory #5855

massquantity commented May 14, 2023 •

edited

Loading

lhoestq commented May 15, 2023

Rocketknight1 commented May 15, 2023

massquantity commented May 15, 2023 •

edited

Loading

Rocketknight1 commented May 15, 2023

Rocketknight1 commented May 15, 2023

Rocketknight1 commented May 15, 2023

to_tf_dataset consumes too much memory #5855

to_tf_dataset consumes too much memory #5855

Comments

massquantity commented May 14, 2023 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented May 15, 2023

Rocketknight1 commented May 15, 2023

massquantity commented May 15, 2023 • edited Loading

Rocketknight1 commented May 15, 2023

Rocketknight1 commented May 15, 2023

Rocketknight1 commented May 15, 2023

`to_tf_dataset` consumes too much memory #5855

`to_tf_dataset` consumes too much memory #5855

massquantity commented May 14, 2023 •

edited

Loading

massquantity commented May 15, 2023 •

edited

Loading