Can not use large dimension in Embedding layer on GPU(s). #31162

rishabhsahrawat · 2019-07-30T11:10:32Z

I am using TF2.0 latest nightly build and I am trying to train LSTM model for text classification on very large dataset of 16455928 sentences. For embedding layer in the model, I have a vocab size of 366856 and I used 1000 as embedding dimension value in it, on which the 2 GPUs(Tesla T4 from Google) ran out of memory.
Since I can not lower the size of vocabulary (maybe there is a way), so I used lower value for embedding dimension (100) on which the model starts training. Now my question is if there is a way I can use higher value of embedding dimension?. Maybe by putting set of layers of my model on different GPUs, if so then what is the way in TF2.0? Also, will using more number GPUs help? Thank you!

pandrey-fr · 2019-07-30T13:34:25Z

Hi @rishabhsahrawat,

Could you be more precise as to when the issue occurs exactly, and what your data processing pipeline looks like (i.e. do you have the entire dataset loaded in memory, or do you iteratively load and discard pieces of it)?

I would not be surprised if your issue was similar to the one I reported in #30952 (which has not been picked up as of now...), but maybe it is a simpler dataflow management issue. In the former case, you might want to try to run after disabling Eager execution (which is not a fix to the issue, but could be a work-around...).

rishabhsahrawat · 2019-07-30T13:57:54Z

Hi @pandrey-fr , thank you for your questions. I am actually using tf.data.Dataset (reading a csv file using make_csv_dataset). After that I created an encoder using tfds.features.text.TokenTextEncoder which gives a vocabulary size of 366856 (unique words). This vocab size is then given to Embedding layer of my model along with the embedding size of 1000 which on running the model throws huge log of error but following lines were interesting for me

Resource exhausted: OOM when allocating tensor with shape[366856, 1000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

So, if I lower the embedding size to <=100, then everything works without any error.
I think my problem is also same like your mentioned issue. I hope they fix it soon. Anyways in the meantime, I will try running after disabling Eager execution. I checked the error log you have mentions, so if you change the values in the Embedding layer to lower values does it work for you with Eager execution and GPU enabled?

pandrey-fr · 2019-07-30T14:16:10Z

So, if I lower the embedding size to <=100, then everything works without any error.

To be clear, does this include the fitting process of the model? If so, then it could just be that the data is indeed too large for the amount of RAM on your GPU. Could you describe your GPU's specs (name, RAM)?

You could also try running your code without enabling GPU use (run `tf.config.experimental.set_visible_devices([], 'GPU') at the beginning of your code) and monitor how much RAM is being used to get a notion of how much memory your model and data require, and whether this is a stable amount throughout the training cycle or an increasing one (which would make it similar to my issue).

Maybe by putting set of layers of my model on different GPUs, if so then what is the way in TF2.0?

I had not noticed this part of the question, but this could be a strategy. I believe you then need to use the tf.device context manager when defining your operations to force their placement on this or that GPU, e.g. something that would look like:

# list available GPUS, make sure you have at least two
gpus = tf.config.experimental.list_logical_devices('GPU')
assert len(gpus) >= 2
# place the embedding layer on the first GPU
with tf.device(gpus[0]):
    embedded = tf.keras.layers.Embedding(input_dim, output_dim)(inputs)
# place the rest of the model on the second GPU
with tf.device(gpus[1]):
    output = some_layers_stack(embedded)

rishabhsahrawat · 2019-07-30T14:37:22Z

Yes, this happens after calling model.fit, if that was your question and my error log is also similar like the one you have mentioned in your issue. I am using 2 Tesla T4 GPUs each with 16 GB memory from Google.
As per your suggestion, I will run it on CPU and see the memory consumption. However, I want to share one information (not sure if this will be helpful). Earlier, I was trying to shuffle the dataset using dataset.shuffle()which fills up the buffer and memory of CPU before running for the first epoch. Eventually, this buffer filling up process uses all 16 GB RAM of my CPU and after this nothing runs. I even raised an issue here. Not sure if this is a normal behaviour.

At the end you have mentioned a sample code for placing model layers on different GPUs, I have a question regarding this. Do you think it will work similarly like tf.distribute.MirroredStrategy() from here? Regardless, I will try this approach too. Thank you!

pandrey-fr · 2019-07-30T15:33:19Z

Earlier, I was trying to shuffle the dataset using dataset.shuffle()which fills up the buffer and memory of CPU before running for the first epoch. Eventually, this buffer filling up process uses all 16 GB RAM of my CPU and after this nothing runs.

Okay, so basically your dataset (multiplied by embedding dimension) is huge, and the embedding vectors representing it cannot be all loaded in memory (either CPU or GPU) at the same time. For your model to run, you therefore need to keep the amount of data loaded at any given time under 16 GB (which is the limit on both your CPU and GPU, notwithstanding the idea of using multiple parallel devices).

If we make the (possibly strong) assumption that there is no actual TensorFlow issue, this should be achievable (with a large embedding dimension) by tweaking your tf.data.Dataset object - basically, you want to load a small part of the data (which makes up for a few batches), shuffle and padded-batch it, feed it to your model, discard it and go on doing the same with the next bit. Are you still using the dataset from the tutorial? If so, I will try to write you a bit of code later on tonight.

Now, once this will be dealt with, we should watch for any memory increase during training (despite the dataflow), which would indicate an issue similar to mine, but hopefully your problem only comes from loading too much data at once.

At the end you have mentioned a sample code for placing model layers on different GPUs, I have a question regarding this. Do you think it will work similarly like tf.distribute.MirroredStrategy() from here?

From what I understand, tf.distribute.MirroredStrategy() creates a copy of your model on each available GPU, so as to run parallel fitting on various batches. If that is indeed the case, the amount of memory used on the first GPU should not decrease; instead, this strategy will boost training runtime by distributing the processing of training samples on parallel copies of the model (and still somehow gather unified weights update - I guess each would treat part of the batch, then gradients would be computed based on an aggregate of local losses, or maybe there is some rougher locally-computed updates aggregations scheme ; I have not looked into the details).

What I was suggesting (and I have absolutely no idea whether that would work) was to put distinct parts of the architecture on the various devices. It might not be possible, or induce data transmission overheads that greatly slow computations (and possibly do not yield the memory gain I had in mind) - so, if you try experimenting in that direction, I would be glad to hear about the results!

rishabhsahrawat · 2019-07-30T16:13:33Z

Yes, I am still using the same way for loading the dataset so padded_batch etc. except shuffling since it doesn’t work for me as I said. I will surely try out the way to divide the layers of the model and hope it will work. I will keep you updated about that. Thank you again for your suggestions and help. -Rishabh

…

On Tuesday, July 30, 2019, Paul Andrey ***@***.***> wrote: Earlier, I was trying to shuffle the dataset using dataset.shuffle()which fills up the buffer and memory of CPU before running for the first epoch. Eventually, this buffer filling up process uses all 16 GB RAM of my CPU and after this nothing runs. Okay, so basically your dataset (multiplied by embedding dimension) is huge, and the embedding vectors representing it cannot be all loaded in memory (either CPU or GPU) at the same time. For your model to run, you therefore need to keep the amount of data loaded at any given time under 16 GB (which is the limit on both your CPU and GPU, notwithstanding the idea of using multiple parallel devices). If we make the (possibly strong) assumption that there is no actual TensorFlow issue, this should be achievable (with a large embedding dimension) by tweaking your tf.data.Dataset object - basically, you want to load a small part of the data (which makes up for a few batches), shuffle and padded-batch it, feed it to your model, discard it and go on doing the same with the next bit. Are you still using the dataset from the tutorial <https://www.tensorflow.org/beta/tutorials/load_data/text#split_the_dataset_into_text_and_train_batches>? If so, I will try to write you a bit of code later on tonight. Now, once this will be dealt with, we should watch for any memory increase during training (despite the dataflow), which would indicate an issue similar to mine, but hopefully your problem only comes from loading too much data at once. At the end you have mentioned a sample code for placing model layers on different GPUs, I have a question regarding this. Do you think it will work similarly like tf.distribute.MirroredStrategy() from here? From what I understand, tf.distribute.MirroredStrategy() creates a copy of your model on each available GPU, so as to run parallel fitting on various batches. If that is indeed the case, the amount of memory used on the first GPU should not decrease; instead, this strategy will boost training runtime by distributing the processing of training samples on parallel copies of the model (and still somehow gather unified weights update - I guess each would treat part of the batch, then gradients would be computed based on an aggregate of local losses, or maybe there is some rougher locally-computed updates aggregations scheme ; I have not looked into the details). What I was suggesting (and I have absolutely no idea whether that would work) was to put distinct parts of the architecture on the various devices. It might not be possible, or induce data transmission overheads that greatly slow computations (and possibly do not yield the memory gain I had in mind) - so, if you try experimenting in that direction, I would be glad to hear about the results! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31162?email_source=notifications&email_token=ADZMMPDPP5VBX4VVGG66O2DQCBOIPA5CNFSM4IH3FFHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3EL4SQ#issuecomment-516472394>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADZMMPBBTUPGO7Z6PYH6QL3QCBOIPANCNFSM4IH3FFHA> .

-- Sent from Gmail iPhone

pandrey-fr · 2019-07-30T21:07:44Z

Hi Rishabh,

Looking further at the tutorial (whose dataset is way smaller than yours, if I understand well) and doing a few tests on my own, I must say I am surprised by the issues you are encountering...

Normally, setting a reasonable buffer_size argument when using shuffle should solve memory issues associated with that part of the data pipeline. As for the embeddings matrix, it is indeed rather big, but when I tried allocating a similarly-shaped one on my system (using alternatively the GPU's 4GB dedicated memory or the 16 GB of RAM I have at my disposal) it fitted with just a slight warning about its size (basically it takes a bit less than 2 GB of memory space, which is a lot for a single weights matrix but should be tractable given your config).

Is there any chance you could share your dataset and code with me (optionally via a private channel) so that I can have a look? I suspect there might be something wrong in the dataflow that would explain your going out of memory, but I could be overly optimistic.

At any rate, an alternative way of decreasing the embeddings' dimensionality would be to decrease the number of tokens in your vocabulary, e.g. using a WordPiece tokenizer to break uncommon tokens down to knows ones (including phonetic tokens if needed). This way you might end up with a matrix with (way) less rows, which might also be a good thing for your modeling (depending on the amount of rarely-used tokens in the dataset).

Best,
Paul

rishabhsahrawat · 2019-07-31T07:50:29Z

Hi Paul, thank you for your response. My dataset size is 16455928 elements/sentences, so in order to make sure that shuffling is perfect I must use either the same or greater value as the size of my full dataset for buffer_size (mentioned here) otherwise there might be some elements that the model might see never or more than once during an epoch.
I am stunned that it fit for you and just using 2 GB of memory. Just to be clear this is how my Embedding layer looks like tf.keras.layers.Embedding(input_dim = 366856, output_dim=1000, input_length = 4) giving 366857000 parameters. This results in error. However, tf.keras.layers.Embedding(input_dim = 366856, output_dim=100, input_length = 4) giving 36685700 parameters works for me.
I am sorry but I can not share the code or dataset, as it does not belong to me. :(
Your suggestion for using WordPiece. After reading about it quickly I think it is more or less similar to stemming words i.e. playing, played etc. -> play. I am already applying Lemmatization and Stemming during preprocessing dataset.

Regards,
Rishabh

gadagashwini-zz · 2019-07-31T08:16:42Z

@rishabhsahrawat Will it be possible to create minimal reproducible code and share with us to move faster. Thanks!

pandrey-fr · 2019-07-31T09:00:30Z

My dataset size is 16455928 elements/sentences, so in order to make sure that shuffling is perfect I must use either the same or greater value as the size of my full dataset for buffer_size (mentioned here) otherwise there might be some elements that the model might see never or more than once during an epoch.

That is correct, however I guess an imperfect shuffling might be better than no shuffling at all, if the dataset comprises some order - but it might be already shuffled "by nature", or could probably be shuffled on disk outside of TensorFlow. I leave it up to you to see what suits best your usecase!

I am stunned that it fit for you and just using 2 GB of memory. Just to be clear this is how my Embedding layer looks like tf.keras.layers.Embedding(input_dim = 366856, output_dim=1000, input_length = 4) giving 366857000 parameters. This results in error. However, tf.keras.layers.Embedding(input_dim = 366856, output_dim=100, input_length = 4) giving 36685700 parameters works for me.

I just tested again using your exact code; I agree on the number of parameters, but on RAM it still takes "only" 1.4 ~ 1.5 GB of RAM, at least after instantiation and feeding it a Tensor (to ensure the weights are indeed built)... But maybe there is something somewhere in your code that triggers the creation of multiple copies of the weights?

I am sorry but I can not share the code or dataset, as it does not belong to me. :(

No problem, that was to be expected.

Your suggestion for using WordPiece. After reading about it quickly I think it is more or less similar to stemming words

That is indeed partially similar, but there is also the breaking down to phonetics of uncommon tokens that, in my short experience, can greatly decrease vocabulary size - but again, this is merely an abstract suggestion and it is up to you to see whether it suits your needs given your data and application context :)

To conclude, it seems that you are encountering a memory issue, whose relying on the mere size of your model on a bug somewhere is still unsure...

Edit: after writing the code below I tried running it and I did get out of GPU memory during training (the first batch went well, the second triggered an allocation error).

@gadagashwini I leave it up to Rishabh to provide with more details about the model he is using, notably as to the output dimensionality, but I guess a first approximation would be:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=366856, output_dim=1000, input_length=4),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(1000)),
    tf.keras.layers.Dense(500, activation='relu'),
    tf.keras.layers.Dense(64, activation='softmax')
])
model.compile('adam', 'sparse_categorical_crossentropy', ['sparse_categorical_accuracy'])

mock_inputs = tf.random.uniform((64, 4), 1, 366856, tf.int64)
mock_target = tf.random.uniform((64, 1), 0, 64, tf.int64)

model.fit(mock_inputs, mock_target, batch_size=32, epochs=5)

rishabhsahrawat · 2019-07-31T09:48:23Z

Hi @gadagashwini , I can not share the dataset unfortunately but I can share the model architecture example. These example layers also, from here throws same errors about memory full, GPU exhausted etc. Updated layers look like this-

model = tf.keras.Sequential() 
model.add(tf.keras.layers.Embedding(input_dim = 366856, output_dim = 1000, input_length = 4))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(256,dropout = 0.5, recurrent_dropout = 0.5,kernel_regularizer = keras.regularizers.l2(0.001))))
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(685731, activation='softmax'))
model.compile(optimizer=tf.keras.optimizers.Adam(),
	          loss='sparse_categorical_crossentropy',
	          metrics=['accuracy'])

If I use output_dim <=150 in Embedding layer, then training starts without any problems. This was all on 2 GPUs.
Now I have access to 2 more GPUs (now 4 in total) and if I use the same model architecture the error has changed to this-

  File "model_with_tfsplit.py", line 99, in <module>
    history = model.fit(train_data, epochs=20, validation_steps = steps ,validation_data=test_data , callbacks = [callback_for_saving])
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 710, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 680, in fit
    steps_name='steps_per_epoch')
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 296, in model_iteration
    batch_outs = f(actual_inputs)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/distribute/distributed_training_utils.py", line 836, in execution_function
    return [out.numpy() for out in distributed_function(input_fn)]
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/eager/def_function.py", line 417, in __call__
    self._initialize(args, kwds, add_initializers_to=initializer_map)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/eager/def_function.py", line 360, in _initialize
    *args, **kwds))
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/eager/function.py", line 1749, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/eager/function.py", line 2053, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/eager/function.py", line 1939, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/framework/func_graph.py", line 795, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/eager/def_function.py", line 310, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/framework/func_graph.py", line 785, in wrapper
    raise e.ag_error_metadata.to_exception(type(e))
RuntimeError: in converted code:
    relative to /home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python:
    keras/distribute/distributed_training_utils.py:825 distributed_function  *
        outputs = strategy.experimental_run_v2(
    distribute/distribute_lib.py:753 experimental_run_v2
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    keras/engine/training.py:976 train_on_batch  *
        outputs = training_eager.train_on_batch(
    keras/engine/training_eager.py:303 train_on_batch
        output_loss_metrics=output_loss_metrics))
    keras/engine/training_eager.py:251 _process_single_batch
        model.optimizer.apply_gradients(zip(grads, trainable_weights))
    keras/optimizer_v2/optimizer_v2.py:435 apply_gradients
        self._create_slots(var_list)
    keras/optimizer_v2/adam.py:148 _create_slots
        self.add_slot(var, 'v')
    keras/optimizer_v2/optimizer_v2.py:587 add_slot
        initial_value=initial_value)
    ops/variables.py:258 __call__
        return cls._variable_v2_call(*args, **kwargs)
    ops/variables.py:252 _variable_v2_call
        shape=shape)
    ops/variables.py:63 getter
        return captured_getter(captured_previous, **kwargs)
    distribute/distribute_lib.py:1394 create_colocated_variable
        return next_creator(*args, **kwargs)
    ops/variables.py:63 getter
        return captured_getter(captured_previous, **kwargs)
    distribute/shared_variable_creator.py:69 create_new_variable
        v = next_creator(*args, **kwargs)
    ops/variables.py:63 getter
        return captured_getter(captured_previous, **kwargs)
    distribute/distribute_lib.py:1306 creator_with_resource_vars
        return self._create_variable(*args, **kwargs)
    distribute/mirrored_strategy.py:512 _create_variable
        values.SyncOnReadVariable, *args, **kwargs)
    distribute/distribute_lib.py:2272 create_mirrored_variable
        value_list = real_mirrored_creator(devices, *args, **kwargs)
    distribute/mirrored_strategy.py:504 _real_mirrored_creator
        v = next_creator(*args, **kwargs)
    ops/variables.py:63 getter
        return captured_getter(captured_previous, **kwargs)
    eager/def_function.py:348 variable_capturing_scope
        lifted_initializer_graph=lifted_initializer_graph, **kwds)
    ops/variables.py:260 __call__
        return super(VariableMetaclass, cls).__call__(*args, **kwargs)
    eager/def_function.py:140 __init__
        initial_value() if init_from_fn else initial_value,
    distribute/mirrored_strategy.py:463 initial_value_fn
        return array_ops.identity(init_value)
    util/dispatch.py:180 wrapper
        return target(*args, **kwargs)
    ops/array_ops.py:92 identity
        copied = input._copy()  # pylint: disable=protected-access
    framework/ops.py:921 _copy
        new_tensor = self._copy_nograd(ctx, device_name)
    framework/ops.py:914 _copy_nograd
        new_tensor = self._copy_to_device(context=ctx._handle, device=device_name)

    RuntimeError: Error copying tensor to device: . Dst tensor is not initialized.

When I use output_dim <= 150, it starts training without errors.
Now, I just found out if I use output_dim = 200, the error is same like on 2 GPUs with output_dim = 1000. Seems like the error log is changing also. :(
I also looked at CPU memory usage while training with output_diim = 150, the memory starts increasing and stops at 14.8 GB from 300 MB. It then increase very slowly as the batches are processed during the epoch. Maybe this behaviour is expected or the problem is like this opened issue.

rishabhsahrawat · 2019-07-31T10:27:15Z

Hi @pandrey-fr , I just tried your dummy model. As you mentioned, training on single GPU also triggered allocation memory problem one time during epoch 2 like yours and one time during epoch 5. The CPU RAM it takes near 2-3 GBs but as soon as the fit model fit function is executed, the RAM usage increase and stops near 14-15 GBs.
I also then tried training it on multiple GPUs after adding MirroredStrategy() scope. On execution, it does use all my GPUs but then throws Assertion error as follows, even before starting first epoch.

  File "<stdin>", line 1, in <module>
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 710, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/training_distributed.py", line 614, in fit
    epochs=epochs)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 2214, in _distribution_standardize_user_data
    assert isinstance(x, dataset_ops.DatasetV2)
AssertionError

Thank you!

pandrey-fr · 2019-07-31T11:06:33Z

Hi @rishabhsahrawat,

Thank you for the details and feedback ; let us hope the issue gets picked up by someone who can clarify what is going on, and especially disentangle expected behaviours from potential bugs.

Just a side note, but as your model's output dimension if the same as the input embedding one (I guess you are doing some kind of language modelling task, e.g. predicting masked or next tokens in a sentence), you might want to re-use the embedding matrix as a kernel on your output layer (this is notably what is done in Transformer models - well, not in the Google tutorial, but in the paper and reference implementations) ; this greatly diminishes the number of trainable weights :)

(note that you have to tweak a little to do that, either with a custom layer or the functional API, but it really is not much of a difficulty)

rishabhsahrawat · 2019-07-31T12:07:11Z

Just a side note, but as your model's output dimension if the same as the input embedding one (I guess you are doing some kind of language modelling task, e.g. predicting masked or next tokens in a sentence), you might want to re-use the embedding matrix as a kernel on your output layer

I am not sure if I understood it correctly, but I think you mean creating something like a 'word2vec' model? If not, can you also share some helpful link to understand it better?
You are right, I am doing tokens prediction.
Thank you!

rishabhsahrawat · 2019-07-31T12:34:29Z

UPDATE
@ymodak @gadagashwini
I just tested the same model with output_dim= 1000 and also 150 on my Macbook Pro with 16 GB RAM and it trains without any problem. Of course the training is slow since it is only CPU. Now I don't what to do. I think the problems are somewhere in MirrorStrategy() and also training on GPUs.

pandrey-fr · 2019-07-31T13:31:21Z

Just a side note, but as your model's output dimension if the same as the input embedding one (I guess you are doing some kind of language modelling task, e.g. predicting masked or next tokens in a sentence), you might want to re-use the embedding matrix as a kernel on your output layer

I am not sure if I understood it correctly, but I think you mean creating something like a 'word2vec' model? If not, can you also share some helpful link to understand it better?
You are right, I am doing tokens prediction.
Thank you!

What I mean is doing something like this (with a possibly more-general implementation):

class SharedKernelSoftmax(tf.keras.layers.Layer):
    def __init__(self, kernel, bias_initializer='zeros'):
        self.kernel = kernel
        self.bias = self.add_weight(
            name='bias', shape=(tf.shape(kernel)[1],), dtype=kernel.dtype,
            initializer=bias_initializer
        )

    def call(self, inputs, **kwargs):
        output = tf.keras.backend.dot(inputs, self.kernel) + self.bias
        return tf.nn.softmax(output, axis=-1)

And using such a layer (instantiated by being passed a shared embedding matrix of the output vocabulary) instead of the final softmax in your model. That being said, I know see that your input and output layers have different dimensions (different vocabulary, I guess?), so this is actually not relevant here (it would be if your input and output vocabulary were the same, or if you used a sequence-to-sequence model where tokens in the output vocabulary are also embedded before being passed to a decoder model). Sorry!

pandrey-fr · 2019-07-31T13:32:22Z

I just tested the same model with output_dim= 1000 and also 150 on my Macbook Pro with 16 GB RAM and it trains without any problem. Of course the training is slow since it is only CPU. Now I don't what to do. I think the problems are somewhere in MirrorStrategy() and also training on GPUs.

Interesting... Could you monitor RAM usage and share some indications as to the amount of memory used when training on CPU?

rishabhsahrawat · 2019-07-31T13:51:15Z

Thank you for clarifying my confusion. As you have noticed the input and output layers do not have same dimensions. So it will not be useful, however since I am tokenizing the data so I am saving encodings to a token file which I can load later without tokenizing the whole dataset on each run.

I am now training the model with output_dim 1000 on CPU and I can see most of the memory is taken but the unused memory is changing to sometimes 3 MB, 600 MB and sometimes even 3 GBs. It is still running the first epoch right now. The %CPU usage is going even upto 1000. I am using top command.

rishabhsahrawat · 2019-07-31T15:11:19Z

@ymodak @gadagashwini I just found out if I train on 2 GPUs then it takes 1s per step to train but when I use all 4 GPUs it takes 2s per step. I think it should train faster on more GPUs.
I am following this for placing on particular devices.
Also, when I am training on first 2 GPUs(GPU:0 and GPU:1) and the other two (GPU:2 and GPU:3) are idle, now if I want to train on the idle 2 GPUs, it is throwing ResourceExhaustedError: OOM when allocating tensor with shape[150,1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Fill] name: Adam/bidirectional/forward_lstm/kernel/m/Initializer/zeros/
I can also share the full log of error if required.
Please help.

yuefengz · 2019-08-05T22:58:26Z

If you use distribution strategy, could you try tf.distribute.experimental.CentralStorageStrategy?

rishabhsahrawat · 2019-08-06T08:20:23Z

Hi @yuefengz , thank you for your suggestion. I tried this by implementing it in the same way as MirroredStrategy so like

central_strategy = tf.distribute.experimental.CentralStorageStrategy()
with central_strategy.scope():
--model def & compilation--

Read about it from here too.
Unfortunately, I am receiving following error, please correct me if my implementation is wrong.

Traceback (most recent call last):
  File "model_with_tfsplit.py", line 82, in <module>
    model.add(tf.keras.layers.Embedding(vocab_size, 150, input_length = 4))
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/training/tracking/base.py", line 458, in _method_wrapper
    result = method(self, *args, **kwargs)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/sequential.py", line 175, in add
    layer(x)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 687, in __call__
    self._maybe_build(inputs)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2005, in _maybe_build
    self.build(input_shapes)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/utils/tf_utils.py", line 296, in wrapper
    output_shape = fn(instance, input_shape)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/layers/embeddings.py", line 134, in build
    constraint=self.embeddings_constraint)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2175, in __setattr__
    if val.trainable:
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/ops/variables.py", line 475, in trainable
    raise NotImplementedError
NotImplementedError
2019-08-06 08:11:01.889111: I tensorflow/core/common_runtime/eager/execute.cc:585] Executing op DestroyResourceOp in device /job:localhost/replica:0/task:0/device:CPU:0

The error log is huge, so sharing only the last part of it. If you require, I can share the full log.

rishabhsahrawat · 2019-08-06T08:30:40Z

I have a question from all of you, @ymodak @gadagashwini @yuefengz , do you think if I use Tf1.14 I can achieve the task of training embedding layer with bigger dimension on multiple GPUs efficiently without any problems that I am facing in TF2.0?. If yes, then could you please share some helpful links for converting the code into TF 1.x. I am new in TF 1.x. Thank you!

rishabhsahrawat · 2019-08-06T09:37:38Z

I have another question. I want to continue training after loading the lastly saved model for which I am defining tf.keras.models.load_mode(my_model.h5) inside mirror.scope(). Since I already compiled it during training earlier so I don't compile it again as also mentioned here. On execution, I am receiving following error

File "model_with_tfsplit.py", line 94, in <module>
    model =tf.keras.models.load_model('TF_model_onfull_2_03.h5') # Loading for retraining
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/saving/save.py", line 138, in load_model
    return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 187, in load_model_from_hdf5
    model._make_train_function()
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 2015, in _make_train_function
    params=self._collected_trainable_weights, loss=self.total_loss)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 500, in get_updates
    grads = self.get_gradients(loss, params)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 391, in get_gradients
    grads = gradients.gradients(loss, params)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/ops/gradients_impl.py", line 158, in gradients
    unconnected_gradients)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/ops/gradients_util.py", line 541, in _GradientsHelper
    for x in xs
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/distribute/values.py", line 716, in handle
    raise ValueError("`handle` is not available outside the replica context"
ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call

Even if I choose a single GPU, it still ends up with the same error. If I do not use mirror.scope() then the training starts from beginning but after few steps it reaches the same accuracy and loss value like before the model was saved. Although not sure if this behaviour is normal or not.

pandrey-fr · 2019-08-06T09:42:14Z

@rishabhsahrawat You might want to try tf.compat.v1.disable_v2_behavior() in TF2.0; basically, it will allow you to 1) see if the issues are caused by v2 behaviors (in which case you can then play with the finer mechanisms-disabling compatibility instructions in order to understand where the problem emerges from) and 2) see which parts of your code actively depend on v2 behaviors (which will thus raise exceptions and require some code changes).

In my limited experience, if you are using the keras API, you do not need to change that much things to run code compatible with TF1.14. The main changes are Eager being disabled (which disables a few things and requires stricter implementation of some others, notably as to type casting, which is somewhat more flexible with Eager enabled), and the necessity, when not using keras, to manually set up placeholders, session objects and initialization instructions; but the latter points are automated through keras backend, so you probably will not have to go through them.

Good luck, and if you run into specific problems, do ask for support :-)

rishabhsahrawat · 2019-08-07T10:23:48Z

Hi @ymodak any updates on the issue?. I am still struggling with it.

yuefengz · 2019-08-17T00:34:53Z

For the problem with CentralStorageStrategy, could you show a colab or something to reproduce the issue?

rishabhsahrawat · 2019-08-22T14:15:16Z

With the latest nightly build, I am able to use larger dimension (max 1000) but not bigger. Also, after raising the issue I started building the model in Tf1.14 and it also has the same issue only so value upto 400 works but not higher.

grofte · 2020-02-19T19:47:03Z

Hi Rishabh,

Generally, an embedding layer in Keras uses more RAM than you would expect. To get the RAM requirements do: no. tokens * no. dimensions * dtype * approx. 15. So for 400k tokens, 1k dimensions, 32 bit dtype you would need about 1.6 * 15 which is about 24 GB. This probably loads on the CPU because it is able to use your swapfile. If the weights were simply in a NumPy array it would only take up about 1.6 GB but Keras / TF is really greedy. I really wish it were less greedy because it would be so much faster to start up a model =(
Maybe PyTorch is better?

saikumarchalla · 2020-04-24T03:42:32Z

@rishabhsahrawat Closing as this is not a bug, but feature request for supporting sharded embeddings in sync training. Please raise a feature request as its working intended. Thanks!.

google-ml-butler · 2020-04-24T03:42:34Z

Are you satisfied with the resolution of your issue?
Yes
No

liyinhgqw · 2021-03-03T12:14:31Z

Hi @yuefengz , thank you for your suggestion. I tried this by implementing it in the same way as MirroredStrategy so like

central_strategy = tf.distribute.experimental.CentralStorageStrategy()
with central_strategy.scope():
--model def & compilation--

Read about it from here too.
Unfortunately, I am receiving following error, please correct me if my implementation is wrong.

Traceback (most recent call last):
  File "model_with_tfsplit.py", line 82, in <module>
    model.add(tf.keras.layers.Embedding(vocab_size, 150, input_length = 4))
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/training/tracking/base.py", line 458, in _method_wrapper
    result = method(self, *args, **kwargs)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/sequential.py", line 175, in add
    layer(x)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 687, in __call__
    self._maybe_build(inputs)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2005, in _maybe_build
    self.build(input_shapes)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/utils/tf_utils.py", line 296, in wrapper
    output_shape = fn(instance, input_shape)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/layers/embeddings.py", line 134, in build
    constraint=self.embeddings_constraint)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2175, in __setattr__
    if val.trainable:
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/ops/variables.py", line 475, in trainable
    raise NotImplementedError
NotImplementedError
2019-08-06 08:11:01.889111: I tensorflow/core/common_runtime/eager/execute.cc:585] Executing op DestroyResourceOp in device /job:localhost/replica:0/task:0/device:CPU:0

The error log is huge, so sharing only the last part of it. If you require, I can share the full log.

I have got the same error when using CentralStorageStrategy (2.4.1). It can be reproduced by changing the strategy in https://github.com/tensorflow/docs/blob/master/site/en/tutorials/distribute/parameter_server_training.ipynb.

It only appears when I used multiple gpus and embedding in my model. For single gpu, it works, and for models that do not have embedding, it also works.

Log:

WARNING:tensorflow:Model was constructed with shape (None, 3) for input KerasTensor(type_spec=TensorSpec(shape=(None, 3), dtype=tf.string, name='feature'), name='feature', description="created by layer 'feature'"), but it was called on an input with incompatible shape (3,).
INFO:tensorflow:Reduce to /device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Error reported to Coordinator: 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/distribute/mirrored_run.py", line 228, in _call_for_each_replica
    **merge_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 667, in wrapper
    return converted_call(f, args, kwargs, options=options)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 376, in converted_call
    options=options)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 350, in converted_call
    return _call_unconverted(f, args, kwargs, options, False)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 478, in _call_unconverted
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 683, in _distributed_apply
    var, apply_grad_to_update_var, args=(grad,), group=False))
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2494, in update
    return self._update(var, fn, args, kwargs, group)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/distribute/parameter_server_strategy.py", line 560, in _update
    **self._select_single_value(kwargs))
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 667, in wrapper
    return converted_call(f, args, kwargs, options=options)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 396, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 478, in _call_unconverted
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 654, in apply_grad_to_update_var
    grad.values, var, grad.indices, **apply_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 1215, in _resource_apply_sparse_duplicate_indices
    **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/adam.py", line 214, in _resource_apply_sparse
    m_t = self._resource_scatter_add(m, indices, m_scaled_g_values)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 1243, in _resource_scatter_add
    return x.value()
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/variables.py", line 460, in value
    raise NotImplementedError
NotImplementedError

@yuefengz has it been fixed?

liying9611 · 2021-04-16T05:50:15Z

@liyinhgqw @yuefengz do you fix it? I got the same error changing MirroredStrategy to CentralStorageStrategy because of OOM, on python3.7, tensorflow2.4, linux.

rishabhsahrawat mentioned this issue Jul 30, 2019

Filling shuffle buffer, this happens before the beginning of every epoch. Is there a way to avoid it? #30646

Closed

gadagashwini-zz self-assigned this Jul 31, 2019

gadagashwini-zz added the TF 2.0 Issues relating to TensorFlow 2.0 label Jul 31, 2019

gadagashwini-zz added the comp:keras Keras related issues label Jul 31, 2019

gadagashwini-zz added the stat:awaiting response Status - Awaiting response from author label Jul 31, 2019

gadagashwini-zz added type:bug Bug and removed stat:awaiting response Status - Awaiting response from author labels Jul 31, 2019

gadagashwini-zz assigned ymodak and unassigned gadagashwini-zz Jul 31, 2019

rishabhsahrawat changed the title ~~Can not use large dimension in Embedding layer.~~ Can not use large dimension in Embedding layer on GPU(s). Jul 31, 2019

ymodak added the comp:dist-strat Distribution Strategy related issues label Jul 31, 2019

rishabhsahrawat mentioned this issue Aug 7, 2019

Restoring Keras model fails inside a distribution strategy scope #30850

Closed

ymodak assigned guptapriya and ymodak and unassigned ymodak and guptapriya Aug 7, 2019

saikumarchalla closed this as completed Apr 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not use large dimension in Embedding layer on GPU(s). #31162

Can not use large dimension in Embedding layer on GPU(s). #31162

rishabhsahrawat commented Jul 30, 2019 •

edited

Loading

pandrey-fr commented Jul 30, 2019

rishabhsahrawat commented Jul 30, 2019 •

edited

Loading

pandrey-fr commented Jul 30, 2019 •

edited

Loading

rishabhsahrawat commented Jul 30, 2019

pandrey-fr commented Jul 30, 2019

rishabhsahrawat commented Jul 30, 2019 via email

pandrey-fr commented Jul 30, 2019

rishabhsahrawat commented Jul 31, 2019 •

edited

Loading

gadagashwini-zz commented Jul 31, 2019

pandrey-fr commented Jul 31, 2019

rishabhsahrawat commented Jul 31, 2019 •

edited

Loading

rishabhsahrawat commented Jul 31, 2019 •

edited

Loading

pandrey-fr commented Jul 31, 2019

rishabhsahrawat commented Jul 31, 2019 •

edited

Loading

rishabhsahrawat commented Jul 31, 2019 •

edited

Loading

pandrey-fr commented Jul 31, 2019

pandrey-fr commented Jul 31, 2019

rishabhsahrawat commented Jul 31, 2019 •

edited

Loading

rishabhsahrawat commented Jul 31, 2019 •

edited

Loading

yuefengz commented Aug 5, 2019

rishabhsahrawat commented Aug 6, 2019

rishabhsahrawat commented Aug 6, 2019

rishabhsahrawat commented Aug 6, 2019 •

edited

Loading

pandrey-fr commented Aug 6, 2019

rishabhsahrawat commented Aug 7, 2019

yuefengz commented Aug 17, 2019

rishabhsahrawat commented Aug 22, 2019

grofte commented Feb 19, 2020

saikumarchalla commented Apr 24, 2020

google-ml-butler bot commented Apr 24, 2020

liyinhgqw commented Mar 3, 2021 •

edited

Loading

liying9611 commented Apr 16, 2021

Can not use large dimension in Embedding layer on GPU(s). #31162

Can not use large dimension in Embedding layer on GPU(s). #31162

Comments

rishabhsahrawat commented Jul 30, 2019 • edited Loading

pandrey-fr commented Jul 30, 2019

rishabhsahrawat commented Jul 30, 2019 • edited Loading

pandrey-fr commented Jul 30, 2019 • edited Loading

rishabhsahrawat commented Jul 30, 2019

pandrey-fr commented Jul 30, 2019

rishabhsahrawat commented Jul 30, 2019 via email

pandrey-fr commented Jul 30, 2019

rishabhsahrawat commented Jul 31, 2019 • edited Loading

gadagashwini-zz commented Jul 31, 2019

pandrey-fr commented Jul 31, 2019

rishabhsahrawat commented Jul 31, 2019 • edited Loading

rishabhsahrawat commented Jul 31, 2019 • edited Loading

pandrey-fr commented Jul 31, 2019

rishabhsahrawat commented Jul 31, 2019 • edited Loading

rishabhsahrawat commented Jul 31, 2019 • edited Loading

pandrey-fr commented Jul 31, 2019

pandrey-fr commented Jul 31, 2019

rishabhsahrawat commented Jul 31, 2019 • edited Loading

rishabhsahrawat commented Jul 31, 2019 • edited Loading

yuefengz commented Aug 5, 2019

rishabhsahrawat commented Aug 6, 2019

rishabhsahrawat commented Aug 6, 2019

rishabhsahrawat commented Aug 6, 2019 • edited Loading

pandrey-fr commented Aug 6, 2019

rishabhsahrawat commented Aug 7, 2019

yuefengz commented Aug 17, 2019

rishabhsahrawat commented Aug 22, 2019

grofte commented Feb 19, 2020

saikumarchalla commented Apr 24, 2020

google-ml-butler bot commented Apr 24, 2020

liyinhgqw commented Mar 3, 2021 • edited Loading

liying9611 commented Apr 16, 2021

rishabhsahrawat commented Jul 30, 2019 •

edited

Loading

rishabhsahrawat commented Jul 30, 2019 •

edited

Loading

pandrey-fr commented Jul 30, 2019 •

edited

Loading

rishabhsahrawat commented Jul 31, 2019 •

edited

Loading

rishabhsahrawat commented Jul 31, 2019 •

edited

Loading

rishabhsahrawat commented Jul 31, 2019 •

edited

Loading

rishabhsahrawat commented Jul 31, 2019 •

edited

Loading

rishabhsahrawat commented Jul 31, 2019 •

edited

Loading

rishabhsahrawat commented Jul 31, 2019 •

edited

Loading

rishabhsahrawat commented Jul 31, 2019 •

edited

Loading

rishabhsahrawat commented Aug 6, 2019 •

edited

Loading

liyinhgqw commented Mar 3, 2021 •

edited

Loading