-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_tf_dataset
consumes too much memory
#5855
Comments
Cc @amyeroberts @Rocketknight1 Indded I think it's because it does something like this under the hood when there's no multiprocessing: tf_dataset = tf_dataset.shuffle(len(dataset)) PS: with multiprocessing it appears to be different: indices = np.arange(len(dataset))
if shuffle:
np.random.shuffle(indices) |
Hi @massquantity, the dataset being shuffled there is not the full dataset. If you look at the line above, the dataset is actually just a single indices array at that point, and that array is the only thing that gets fully loaded into memory and shuffled. We then load samples from the dataset by applying a transform function to the shuffled dataset, which fetches samples based on the indices it receives. If your dataset is really gigantic, then this index tensor might be a memory issue, but since it's just an int64 tensor it will only use 1GB of memory per 125 million samples. Still, if you're encountering memory issues, there might be another cause here - can you share some code to reproduce the error, or does it depend on some internal/proprietary dataset? |
Hi @Rocketknight1, you're right and I also noticed that only indices are used in shuffling. My data has shape (50000000, 10), but really the problem doesn't relate to a specific dataset. Simply running the following code costs me 10GB of memory. from datasets import Dataset
def gen():
for i in range(50000000):
yield {"data": i}
ds = Dataset.from_generator(gen, cache_dir="./huggingface")
tf_ds = ds.to_tf_dataset(
batch_size=1,
shuffle=True,
drop_remainder=False,
prefetch=True,
)
tf_ds = iter(tf_ds)
next(tf_ds)
# {'data': <tf.Tensor: shape=(1,), dtype=int64, numpy=array([0])>} I just realized maybe it was an issue from tensorflow (I'm using tf 2.12). So I tried the following code, and it used 10GB of memory too. import numpy as np
import tensorflow as tf
data_size = 50000000
tf_dataset = tf.data.Dataset.from_tensor_slices(np.arange(data_size))
tf_dataset = iter(tf_dataset.shuffle(data_size))
next(tf_dataset)
# <tf.Tensor: shape=(), dtype=int64, numpy=24774043> By the way, as @lhoestq mentioned, multiprocessing uses numpy shuffling, and it uses less than 1 GB of memory: tf_ds_mp = ds.to_tf_dataset(
batch_size=1,
shuffle=True,
drop_remainder=False,
prefetch=True,
num_workers=2,
) |
Thanks for that reproduction script - I've confirmed the same issue is occurring for me. Investigating it now! |
Update: The memory usage is occurring in creation of the index and shuffle buffer. You can reproduce it very simply with: import tensorflow as tf
indices = tf.range(50_000_000, dtype=tf.int64)
dataset = tf.data.Dataset.from_tensor_slices(indices)
dataset = dataset.shuffle(len(dataset))
print(next(iter(dataset)) When I wrote this code I thought |
I opened a PR to fix this - will continue the discussion there! |
Describe the bug
Hi, I'm using
to_tf_dataset
to convert a large dataset totf.data.Dataset
. I observed that the data loading before training took a lot of time and memory, even withbatch_size=1
.After some digging, i believe the reason lies in the shuffle behavior. The source code uses
len(dataset)
as thebuffer_size
, which may load all the data into the memory, and the tf.data doc also states that "While large buffer_sizes shuffle more thoroughly, they can take a lot of memory, and significant time to fill".Steps to reproduce the bug
Expected behavior
Shuffling should not load all the data into the memory. Would adding a
buffer_size
parameter in theto_tf_dataset
API alleviate the problem?Environment info
datasets
version: 2.11.0The text was updated successfully, but these errors were encountered: