Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_tf_dataset consumes too much memory #5855

Closed
massquantity opened this issue May 14, 2023 · 6 comments · Fixed by #5863
Closed

to_tf_dataset consumes too much memory #5855

massquantity opened this issue May 14, 2023 · 6 comments · Fixed by #5863

Comments

@massquantity
Copy link

massquantity commented May 14, 2023

Describe the bug

Hi, I'm using to_tf_dataset to convert a large dataset to tf.data.Dataset. I observed that the data loading before training took a lot of time and memory, even with batch_size=1.

After some digging, i believe the reason lies in the shuffle behavior. The source code uses len(dataset) as the buffer_size, which may load all the data into the memory, and the tf.data doc also states that "While large buffer_sizes shuffle more thoroughly, they can take a lot of memory, and significant time to fill".

Steps to reproduce the bug

from datasets import Dataset

def gen():  # some large data
    for i in range(50000000):
        yield {"data": i}

ds = Dataset.from_generator(gen, cache_dir="./huggingface")

tf_ds = ds.to_tf_dataset(
    batch_size=64,
    shuffle=False,  # no shuffle
    drop_remainder=False,
    prefetch=True,
)

# fast and memory friendly 🤗
for batch in tf_ds: 
    ...

tf_ds_shuffle = ds.to_tf_dataset(
    batch_size=64,
    shuffle=True,
    drop_remainder=False,
    prefetch=True,
)

# slow and memory hungry for simple iteration 😱
for batch in tf_ds_shuffle: 
    ...

Expected behavior

Shuffling should not load all the data into the memory. Would adding a buffer_size parameter in the to_tf_dataset API alleviate the problem?

Environment info

  • datasets version: 2.11.0
  • Platform: Linux-5.17.1-051701-generic-x86_64-with-glibc2.17
  • Python version: 3.8.13
  • Huggingface_hub version: 0.13.4
  • PyArrow version: 11.0.0
  • Pandas version: 1.4.3
@lhoestq
Copy link
Member

lhoestq commented May 15, 2023

Cc @amyeroberts @Rocketknight1

Indded I think it's because it does something like this under the hood when there's no multiprocessing:

tf_dataset = tf_dataset.shuffle(len(dataset))

PS: with multiprocessing it appears to be different:

indices = np.arange(len(dataset))
if shuffle:
    np.random.shuffle(indices)

@Rocketknight1
Copy link
Member

Hi @massquantity, the dataset being shuffled there is not the full dataset. If you look at the line above, the dataset is actually just a single indices array at that point, and that array is the only thing that gets fully loaded into memory and shuffled. We then load samples from the dataset by applying a transform function to the shuffled dataset, which fetches samples based on the indices it receives.

If your dataset is really gigantic, then this index tensor might be a memory issue, but since it's just an int64 tensor it will only use 1GB of memory per 125 million samples.

Still, if you're encountering memory issues, there might be another cause here - can you share some code to reproduce the error, or does it depend on some internal/proprietary dataset?

@massquantity
Copy link
Author

massquantity commented May 15, 2023

Hi @Rocketknight1, you're right and I also noticed that only indices are used in shuffling. My data has shape (50000000, 10), but really the problem doesn't relate to a specific dataset. Simply running the following code costs me 10GB of memory.

from datasets import Dataset

def gen():
    for i in range(50000000):
        yield {"data": i}

ds = Dataset.from_generator(gen, cache_dir="./huggingface")

tf_ds = ds.to_tf_dataset(
    batch_size=1,
    shuffle=True,
    drop_remainder=False,
    prefetch=True,
)
tf_ds = iter(tf_ds)
next(tf_ds)
# {'data': <tf.Tensor: shape=(1,), dtype=int64, numpy=array([0])>}

I just realized maybe it was an issue from tensorflow (I'm using tf 2.12). So I tried the following code, and it used 10GB of memory too.

import numpy as np
import tensorflow as tf

data_size = 50000000
tf_dataset = tf.data.Dataset.from_tensor_slices(np.arange(data_size))
tf_dataset = iter(tf_dataset.shuffle(data_size))
next(tf_dataset)
# <tf.Tensor: shape=(), dtype=int64, numpy=24774043>

By the way, as @lhoestq mentioned, multiprocessing uses numpy shuffling, and it uses less than 1 GB of memory:

tf_ds_mp = ds.to_tf_dataset(
    batch_size=1,
    shuffle=True,
    drop_remainder=False,
    prefetch=True,
    num_workers=2,
)

@Rocketknight1
Copy link
Member

Thanks for that reproduction script - I've confirmed the same issue is occurring for me. Investigating it now!

@Rocketknight1
Copy link
Member

Update: The memory usage is occurring in creation of the index and shuffle buffer. You can reproduce it very simply with:

import tensorflow as tf
indices = tf.range(50_000_000, dtype=tf.int64)
dataset = tf.data.Dataset.from_tensor_slices(indices)
dataset = dataset.shuffle(len(dataset))
print(next(iter(dataset))

When I wrote this code I thought tf.data had an optimization for shuffling an entire tensor that wouldn't create the entire shuffle buffer, but evidently it's just creating the enormous buffer in memory. I'll see if I can find a more efficient way to do this - we might end up moving everything to the numpy multiprocessing path to avoid it.

@Rocketknight1
Copy link
Member

I opened a PR to fix this - will continue the discussion there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants