-
Notifications
You must be signed in to change notification settings - Fork 74.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf.keras.layers.Embedding
causes memory leak
#30952
Comments
@pandrey-fr Hi Paul, |
tf.keras.layers.Embedding
causes GPU memory leaktf.keras.layers.Embedding
causes memory leak
After additional testing, I found out the high memory usage is not exclusive to the GPU, and update the initial post accordingly. |
@rishabhsahrawat Hi, I unfortunately have no experience with GPU distribution strategies, and only have 1-GPU machines at my current disposal, hence I would not know how to help you... Sorry :/ |
I could reproduce the reported issue on Colab with Tensorflow version 2.0.0.beta1. Please take a look at gist of colab. Thanks! |
Additional information: since this aforementioned post recommends taking the embedding lookup out of the training loop, I ran a modified version of the code where the embedding layers are declared outside of the instantiated keras Model (which now takes pre-embedded Tensors as inputs) and applied to the datasets at the time of their reshaping (in the This does not resolve the issue, and the GPU runs out of memory just as fast. However, if I try to load the entire training set in (non-GPU) memory (e.g. |
With some effort, I found a way to export and reload the datasets after their creation (which requires Eager execution), so that I was able to run Long story short, it turns out the issue does not show up when Eager is disabled (and the fitting goes slightly faster - on CPU, 250s / epoch instead of 280s ; and obviously enabling GPU use makes for a great runtime gain, with less than 80s /epoch). So, Eager execution messes things up badly... Why does that seem to be the endpoint of each and every issue I encounter these days? Anyway, I hope someone can find out where things go wrong with Eager enabled, and how to fix this (because disabling Eager is not exactly a fix, just a workaround for the times being - and an option I would personally like to keep in the future, outside of the Code to reproduce (not including the functions defined in the aforeshared gist) First session - Eager execution is enabled. # Use aforeshared code to define setup_dataset
import numpy as np
train, valid, inp_voc_size, tar_voc_size = setup_dataset()
np.save('train.npy', [(x.numpy(), y.numpy()) for x, y in train])
np.save('valid.npy', [(x.numpy(), y.numpy()) for x, y in valid])
# I also ran commands to get the constants and note them somewhere.
# In the second run, I therefore hard-code them for simplicity.
# input vocab size is 8443, target vocab size is 8356
# train set comprises 704 batches, validation set has 17 Second session import tensorflow as tf
import numpy as np
tf.compat.v1.disable_eager_execution()
def reload_dataset(path):
"""Reload a dumped dataset and finish formatting it."""
data = np.load(path, allow_pickle=True).tolist()
def generator():
for inputs, target in data:
yield ((inputs, target[:, :-1]), target[:, 1:])
types = ((tf.int64, tf.int64), tf.int64)
shape = (((None, None), (None, None)), (None, None))
dataset = tf.data.Dataset.from_generator(generator, types, shape)
return dataset
# use aforeshared code to define setup_model
def main():
train = reload_dataset('train.npy')
valid = reload_dataset('valid.npy')
model = setup_model(8443, 8356)
model.fit(
epochs=10, x=train.repeat(), steps_per_epoch=704,
validation_data=valid.repeat(), validation_steps=17,
)
if __name__ == '__main__':
main() |
Oh, and for the sake of it: I tried fitting a model with Eager enabled after reloading the data from the .npy dumps, and the memory issue is still there (i.e. it is not caused by the use of dataset transformations in |
As I am still hoping that someone will pick up this issue, I conducted (yet) additional testing, namely replacing The class I used to replace class OneHotEmbedding(tf.keras.layers.Embedding):
"Embedding layer with one-hot dot-product retrieval mechanism."""
def call(self, inputs):
"""Embed some inputs."""
one_hot = tf.one_hot(inputs, depth=self.input_dim, dtype=tf.float32)
return tf.keras.backend.dot(one_hot, self.embeddings) Again, disabling Eager has everything run as smoothly as I want... |
I am happy to see some activity popping (@robieta, you seem to be quite the expert on this kind of issue!), however I see the |
Hi, Thanks for the report and the reproduction script. I am not able to reproduce the memory leak with the beta1 release on CPU, with the script provided. I will try GPU next. The tutorial script itself does not seem to feature a memory leak either on CPU or GPU. Please try your reproduction script with your local configuration with the TF2.0 nightly build: https://pypi.org/project/tf-nightly-2.0-preview/ |
So, I was actually able to reproduce the problem on GPU (but not CPU). Will investigate further. |
It seems that updating the TF version from beta1 to the latest nightly fixes the issue for me. Could you check if the update works for you as well?
|
Hey, I just want to say that I experienced the same problems with memory leaks (very high memory usage during first epoch) and installing the nightly version instead of beta1 fixed it for me. |
Hi, |
As suggested, I installed a gpu-enabled 2.0 nightly build (from binary, using pip ; version When using the GPU, I get the following error after the first 38 training steps:
When running solely on CPU, the training runs but the RAM usage goes up as before, i.e. very much during the first steps, and more slowly but still up as further steps are run (at the end of the first epoch, I reached nearly 12 GB of RAM usage). The amount of RAM used remained stable during the second epoch (with small up-and-down fluctuations seemingly related to the loading and discard of data batches, which is normal). |
I also re-ran the tests adding the |
@pandrey-fr This was resolved in |
Awesome! I will test on my custom model tomorrow, but as for the example case I initially shared, it is indeed running just fine with |
As a conclusive note: I ran my actual model with 2.0 rc0, comparing performances with and without disabling Eager execution. Most importantly, I am happy to report that leaving Eager enabled no longer causes memory issues. Regarding fitting runtimes, disabling Eager still yields a slight gain (122 seconds per epoch, versus 145 for the first and 135 for the following ones when Eager is left enabled - so, around 10 % runtime difference), but this is a relatively small gap compared to what I encountered in 2.0b1. On the overall, Eager now seems much more stable than a couple of months ago - an impressive progress which must have taken a lot of hard work from all people involved, so many thanks and congrats for that! |
System information
Describe the current behavior
A GPU (edit: CPU as well, see addendum below) memory leak (rapidly) emerges from using (high-dimensional)
tf.keras.layers.Embedding
layers.To be more precise, I am working on Transformer networks, and found out that when I try to fit one, e.g. on the portuguese-to-english translation task presented in this official tutorial, a GPU memory leak emerges after a few iterations. Based on this StackOverflow post, I rapidly came to suspect that the issue comes from the (learnable) embedding layers at the base of both the encoder and decoder parts of the network.
To further assess the issue and its source, I implemented a pseudo-Transformer network (see code linked below) that is stripped of most technical components the actual model embarks (e.g. I removed positional encoding, residual connections, masking mechanisms, etc.) - the rationale being to provide a more condense (and faster-run) code to document this issue, but also to confirm that the leak does not come from custom layers or any "complex" data processing mechanism.
The provided code includes a data pre-processing pipeline entirely based on the aforementioned tutorial, a model-construction function that makes use of the keras functional API, and a main function to call the former and start the fitting process. On my computer, everything runs fine and I can see the first few fitting iterations pass, until an ugly stack of allocation error messages show up (see full log linked below), whose informative part seems to be
W tensorflow/core/framework/op_kernel.cc:1546] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor
Addendum: I re-ran the provided code disabling access to the GPU, and it turns out there also is a high memory usage when running on CPU. During the first epoch (and mostly during its first half), memory usage goes up multiple GB (in my case, from 2 to 10 GB, with an increase from 2 to 7 within the first 60 train steps out of 704), and keeps slowly increasing throughout the following epochs (with minor decreases between increases, thus displaying local plateaux which I would guess are related to the loading / discarding of data batches). Although it is a bit less of a problem than with GPU since it is relatively common to have quite some RAM available (plus some swap space, on linux), it still does not feel right that fitting the fake model on a dataset which can be fully loaded in memory (creating a list of Eager Tensors from the
tf.data.Dataset
object containing the batched, padded training set results in a marginal usage of around 100 MB of RAM) would end up using 16GB or RAM. I would also like to note that callinggc.collect
after training does not empty the used RAM, which is only freed (instantly) when ending the python process.Describe the expected behavior
The fitting process should go one fine, and the memory should not get saturated (I would expect some tensors to be de-allocated as iterations pass).
Code to reproduce the issue
The script I wrote to illustrate the issue is publicly accessible as a gist here.
Other info / logs
The full error stack (with GPU enabled) is publicly accessible as a gist here
The text was updated successfully, but these errors were encountered: