Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.keras.layers.Embedding causes memory leak #30952

Closed
pandrey-fr opened this issue Jul 23, 2019 · 20 comments
Closed

tf.keras.layers.Embedding causes memory leak #30952

pandrey-fr opened this issue Jul 23, 2019 · 20 comments
Labels
comp:keras Keras related issues TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug type:performance Performance Issue

Comments

@pandrey-fr
Copy link
Contributor

pandrey-fr commented Jul 23, 2019

System information

  • Have I written custom code: yes
  • OS Platform and Distribution: Linux Mint 19.1
  • TensorFlow installed from: binary (using pip)
  • TensorFlow version: 2.0.0-beta1 (v2.0.0-beta0-16-g1d91213fe7)
  • Python version: 3.6.8
  • CUDA/cuDNN version: 10.0 / 7.5
  • GPU model and memory: Nvidia Quadro P1000 - 4 GB GDDR5

Describe the current behavior

A GPU (edit: CPU as well, see addendum below) memory leak (rapidly) emerges from using (high-dimensional) tf.keras.layers.Embedding layers.

To be more precise, I am working on Transformer networks, and found out that when I try to fit one, e.g. on the portuguese-to-english translation task presented in this official tutorial, a GPU memory leak emerges after a few iterations. Based on this StackOverflow post, I rapidly came to suspect that the issue comes from the (learnable) embedding layers at the base of both the encoder and decoder parts of the network.

To further assess the issue and its source, I implemented a pseudo-Transformer network (see code linked below) that is stripped of most technical components the actual model embarks (e.g. I removed positional encoding, residual connections, masking mechanisms, etc.) - the rationale being to provide a more condense (and faster-run) code to document this issue, but also to confirm that the leak does not come from custom layers or any "complex" data processing mechanism.

The provided code includes a data pre-processing pipeline entirely based on the aforementioned tutorial, a model-construction function that makes use of the keras functional API, and a main function to call the former and start the fitting process. On my computer, everything runs fine and I can see the first few fitting iterations pass, until an ugly stack of allocation error messages show up (see full log linked below), whose informative part seems to be W tensorflow/core/framework/op_kernel.cc:1546] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor

Addendum: I re-ran the provided code disabling access to the GPU, and it turns out there also is a high memory usage when running on CPU. During the first epoch (and mostly during its first half), memory usage goes up multiple GB (in my case, from 2 to 10 GB, with an increase from 2 to 7 within the first 60 train steps out of 704), and keeps slowly increasing throughout the following epochs (with minor decreases between increases, thus displaying local plateaux which I would guess are related to the loading / discarding of data batches). Although it is a bit less of a problem than with GPU since it is relatively common to have quite some RAM available (plus some swap space, on linux), it still does not feel right that fitting the fake model on a dataset which can be fully loaded in memory (creating a list of Eager Tensors from the tf.data.Dataset object containing the batched, padded training set results in a marginal usage of around 100 MB of RAM) would end up using 16GB or RAM. I would also like to note that calling gc.collect after training does not empty the used RAM, which is only freed (instantly) when ending the python process.

Describe the expected behavior

The fitting process should go one fine, and the memory should not get saturated (I would expect some tensors to be de-allocated as iterations pass).

Code to reproduce the issue

The script I wrote to illustrate the issue is publicly accessible as a gist here.

Other info / logs

The full error stack (with GPU enabled) is publicly accessible as a gist here

@rishabhsahrawat
Copy link

@pandrey-fr Hi Paul,
this is not related to your problem. However, I am stuck at one step before you, since you have helped me earlier so I thought maybe you can give me some suggestions.
I am training my tf.keras model on multiple GPUs (2 exactly) and following TF2.0's example for doing that. I am adding these two lines
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
before the model's definition and compilations as also shown in the link. Now, I am receiving errors. My question from you is if there is anything else that has to be added in order to train it on multiple GPUs? I have my own tf.data.Dataset separates into train and test sets after padding and batching.
Thank you for your helpful reply. I already opened an issue on TF but haven't gotten any response.

@pandrey-fr pandrey-fr changed the title tf.keras.layers.Embedding causes GPU memory leak tf.keras.layers.Embedding causes memory leak Jul 23, 2019
@pandrey-fr
Copy link
Contributor Author

After additional testing, I found out the high memory usage is not exclusive to the GPU, and update the initial post accordingly.

@pandrey-fr
Copy link
Contributor Author

@rishabhsahrawat Hi, I unfortunately have no experience with GPU distribution strategies, and only have 1-GPU machines at my current disposal, hence I would not know how to help you... Sorry :/

@gadagashwini-zz
Copy link
Contributor

I could reproduce the reported issue on Colab with Tensorflow version 2.0.0.beta1. Please take a look at gist of colab. Thanks!

@pandrey-fr
Copy link
Contributor Author

Additional information: since this aforementioned post recommends taking the embedding lookup out of the training loop, I ran a modified version of the code where the embedding layers are declared outside of the instantiated keras Model (which now takes pre-embedded Tensors as inputs) and applied to the datasets at the time of their reshaping (in the reformat_inputs function).

This does not resolve the issue, and the GPU runs out of memory just as fast. However, if I try to load the entire training set in (non-GPU) memory (e.g. data = list(iter(trainset))), it works, but the memory used (which is greater than when loading the non-embedded data, which makes sense) is not freed upon deletion. I.e. it appears that every time the data is loaded, memory is allocated that cannot be de-allocated - and apparently, more data loading occurs when fitting the model than I would expect (since the increase (with GPU disabled) is far greater when running model.fit than when listing the contents of the trainset object).

@pandrey-fr
Copy link
Contributor Author

pandrey-fr commented Jul 24, 2019

With some effort, I found a way to export and reload the datasets after their creation (which requires Eager execution), so that I was able to run setup_dataset, dump the results, restart python, tf.compat.v1.disable_eager_execution, reload the datasets, setup_model and fit it without Eager.

Long story short, it turns out the issue does not show up when Eager is disabled (and the fitting goes slightly faster - on CPU, 250s / epoch instead of 280s ; and obviously enabling GPU use makes for a great runtime gain, with less than 80s /epoch).

So, Eager execution messes things up badly... Why does that seem to be the endpoint of each and every issue I encounter these days? Anyway, I hope someone can find out where things go wrong with Eager enabled, and how to fix this (because disabling Eager is not exactly a fix, just a workaround for the times being - and an option I would personally like to keep in the future, outside of the compat sub-module, but that is another question).

Code to reproduce (not including the functions defined in the aforeshared gist)

First session - Eager execution is enabled.

# Use aforeshared code to define setup_dataset
import numpy as np

train, valid, inp_voc_size, tar_voc_size = setup_dataset()
np.save('train.npy', [(x.numpy(), y.numpy()) for x, y in train])
np.save('valid.npy', [(x.numpy(), y.numpy()) for x, y in valid])

# I also ran commands to get the constants and note them somewhere.
# In the second run, I therefore hard-code them for simplicity.
# input vocab size is 8443, target vocab size is 8356
# train set comprises 704 batches, validation set has 17

Second session

import tensorflow as tf
import numpy as np

tf.compat.v1.disable_eager_execution()

def reload_dataset(path):
    """Reload a dumped dataset and finish formatting it."""
    data = np.load(path, allow_pickle=True).tolist() 
    def generator(): 
        for inputs, target in data: 
            yield ((inputs, target[:, :-1]), target[:, 1:]) 
    types = ((tf.int64, tf.int64), tf.int64) 
    shape = (((None, None), (None, None)), (None, None)) 
    dataset = tf.data.Dataset.from_generator(generator, types, shape) 
    return dataset 


# use aforeshared code to define setup_model


def main():
    train = reload_dataset('train.npy')
    valid = reload_dataset('valid.npy')
    model = setup_model(8443, 8356)
    model.fit(
        epochs=10, x=train.repeat(), steps_per_epoch=704,
        validation_data=valid.repeat(), validation_steps=17,
    )

if __name__ == '__main__':
    main()

@pandrey-fr
Copy link
Contributor Author

Oh, and for the sake of it: I tried fitting a model with Eager enabled after reloading the data from the .npy dumps, and the memory issue is still there (i.e. it is not caused by the use of dataset transformations in setup_dataset).

@pandrey-fr
Copy link
Contributor Author

As I am still hoping that someone will pick up this issue, I conducted (yet) additional testing, namely replacing tf.keras.layers.Embedding with a subclass that overrides the call method in order to use one-hot-encoding and dot-product to retrieve embeddings instead of tf.nn.lookup (see code below). This does not fix the issue, which therefore seems to be a general input tensors non-discarding issue in Eager execution (regardless of how they are created).

The class I used to replace tf.keras.layers.Embedding inside the setup_pseudo_model function:

class OneHotEmbedding(tf.keras.layers.Embedding):
    "Embedding layer with one-hot dot-product retrieval mechanism."""

    def call(self, inputs):
        """Embed some inputs."""
        one_hot = tf.one_hot(inputs, depth=self.input_dim, dtype=tf.float32)
        return tf.keras.backend.dot(one_hot, self.embeddings)

Again, disabling Eager has everything run as smoothly as I want...

@jvishnuvardhan jvishnuvardhan added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:performance Performance Issue and removed type:bug Bug labels Jul 30, 2019
@pandrey-fr
Copy link
Contributor Author

I am happy to see some activity popping (@robieta, you seem to be quite the expert on this kind of issue!), however I see the type:bug tag was removed which in my humble opinion is incorrect; this is not just a performance issue (as is the case, e.g. in #30561), this is a bug, in the sense that training is not just slower, it is made altogether impossible on some configurations (namely when a GPU is visible or the amount of RAM is under 16 GB, which it really should not given the size of the model and dataset).

@fchollet
Copy link
Contributor

Hi,

Thanks for the report and the reproduction script.

I am not able to reproduce the memory leak with the beta1 release on CPU, with the script provided. I will try GPU next.

The tutorial script itself does not seem to feature a memory leak either on CPU or GPU.

Please try your reproduction script with your local configuration with the TF2.0 nightly build: https://pypi.org/project/tf-nightly-2.0-preview/

@fchollet
Copy link
Contributor

fchollet commented Aug 13, 2019

So, I was actually able to reproduce the problem on GPU (but not CPU). Will investigate further.

@fchollet
Copy link
Contributor

It seems that updating the TF version from beta1 to the latest nightly fixes the issue for me. Could you check if the update works for you as well?

pip install tf-nightly-gpu-2.0-preview

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Aug 14, 2019
@BeWe11
Copy link

BeWe11 commented Aug 14, 2019

Hey, I just want to say that I experienced the same problems with memory leaks (very high memory usage during first epoch) and installing the nightly version instead of beta1 fixed it for me.

@pandrey-fr
Copy link
Contributor Author

Hi,
Thank you for the follow-up on this. I currently am on the move and will only have access to the machine I ran those tests on next week, but hopefully the nightly build comes with the rightful fixes indeed. I will post the results (and hopefully close this issue) next week.

@pandrey-fr
Copy link
Contributor Author

pandrey-fr commented Aug 19, 2019

@fchollet

As suggested, I installed a gpu-enabled 2.0 nightly build (from binary, using pip ; version 2.0.0-dev20190731 / git version v1.12.1-7529-g3e0ad8a004) and ran the test script again (without the line disabling Eager execution), both with and without GPU; unfortunately, I still encounter the same issue.

When using the GPU, I get the following error after the first 38 training steps:

ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted:  OOM when allocating tensor with shape[64,37,8356] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node loss/dense_2_loss/clip_by_value/Minimum (defined at /home/pandrey/Documents/tfnightly/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1686) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[training/Adam/gradients/gradients/embedding_1/embedding_lookup_grad/Reshape/_34]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted:  OOM when allocating tensor with shape[64,37,8356] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node loss/dense_2_loss/clip_by_value/Minimum (defined at /home/pandrey/Documents/tfnightly/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1686) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored. [Op:__inference_keras_scratch_graph_1551]

Function call stack:
keras_scratch_graph -> keras_scratch_graph

When running solely on CPU, the training runs but the RAM usage goes up as before, i.e. very much during the first steps, and more slowly but still up as further steps are run (at the end of the first epoch, I reached nearly 12 GB of RAM usage). The amount of RAM used remained stable during the second epoch (with small up-and-down fluctuations seemingly related to the loading and discard of data batches, which is normal).

@pandrey-fr
Copy link
Contributor Author

I also re-ran the tests adding the tf.compat.v1.disable_eager_execution() line, which again avoids triggering the issue. On CPU, RAM usage fluctuates between 2 and 2.4 GB, and on GPU the available 4GB of dedicated memory are not exhausted (and training goes way faster than the first steps run with both GPU and Eager enabled).

@jvishnuvardhan
Copy link
Contributor

jvishnuvardhan commented Aug 23, 2019

@pandrey-fr This was resolved in 2.0.0rc0 and was tested internally. Can you check and let us know whether it was resolved for you when you use !pip install tensorflow==2.0.0rc0. Thanks!

@pandrey-fr
Copy link
Contributor Author

Awesome! I will test on my custom model tomorrow, but as for the example case I initially shared, it is indeed running just fine with 2.0.0rc0. Many thanks and congratulations on the work already achieved towards the 2.0 release :-)

@tensorflow-bot
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@pandrey-fr
Copy link
Contributor Author

As a conclusive note:

I ran my actual model with 2.0 rc0, comparing performances with and without disabling Eager execution. Most importantly, I am happy to report that leaving Eager enabled no longer causes memory issues. Regarding fitting runtimes, disabling Eager still yields a slight gain (122 seconds per epoch, versus 145 for the first and 135 for the following ones when Eager is left enabled - so, around 10 % runtime difference), but this is a relatively small gap compared to what I encountered in 2.0b1.

On the overall, Eager now seems much more stable than a couple of months ago - an impressive progress which must have taken a lot of hard work from all people involved, so many thanks and congrats for that!

@lvenugopalan lvenugopalan added the TF 2.0 Issues relating to TensorFlow 2.0 label Apr 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:keras Keras related issues TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

9 participants