-
Notifications
You must be signed in to change notification settings - Fork 74.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can not use large dimension in Embedding layer on GPU(s). #31162
Comments
Hi @rishabhsahrawat, Could you be more precise as to when the issue occurs exactly, and what your data processing pipeline looks like (i.e. do you have the entire dataset loaded in memory, or do you iteratively load and discard pieces of it)? I would not be surprised if your issue was similar to the one I reported in #30952 (which has not been picked up as of now...), but maybe it is a simpler dataflow management issue. In the former case, you might want to try to run after disabling Eager execution (which is not a fix to the issue, but could be a work-around...). |
Hi @pandrey-fr , thank you for your questions. I am actually using
So, if I lower the embedding size to <=100, then everything works without any error. |
To be clear, does this include the fitting process of the model? If so, then it could just be that the data is indeed too large for the amount of RAM on your GPU. Could you describe your GPU's specs (name, RAM)? You could also try running your code without enabling GPU use (run `tf.config.experimental.set_visible_devices([], 'GPU') at the beginning of your code) and monitor how much RAM is being used to get a notion of how much memory your model and data require, and whether this is a stable amount throughout the training cycle or an increasing one (which would make it similar to my issue).
I had not noticed this part of the question, but this could be a strategy. I believe you then need to use the # list available GPUS, make sure you have at least two
gpus = tf.config.experimental.list_logical_devices('GPU')
assert len(gpus) >= 2
# place the embedding layer on the first GPU
with tf.device(gpus[0]):
embedded = tf.keras.layers.Embedding(input_dim, output_dim)(inputs)
# place the rest of the model on the second GPU
with tf.device(gpus[1]):
output = some_layers_stack(embedded) |
Yes, this happens after calling At the end you have mentioned a sample code for placing model layers on different GPUs, I have a question regarding this. Do you think it will work similarly like |
Okay, so basically your dataset (multiplied by embedding dimension) is huge, and the embedding vectors representing it cannot be all loaded in memory (either CPU or GPU) at the same time. For your model to run, you therefore need to keep the amount of data loaded at any given time under 16 GB (which is the limit on both your CPU and GPU, notwithstanding the idea of using multiple parallel devices). If we make the (possibly strong) assumption that there is no actual TensorFlow issue, this should be achievable (with a large embedding dimension) by tweaking your Now, once this will be dealt with, we should watch for any memory increase during training (despite the dataflow), which would indicate an issue similar to mine, but hopefully your problem only comes from loading too much data at once.
From what I understand, What I was suggesting (and I have absolutely no idea whether that would work) was to put distinct parts of the architecture on the various devices. It might not be possible, or induce data transmission overheads that greatly slow computations (and possibly do not yield the memory gain I had in mind) - so, if you try experimenting in that direction, I would be glad to hear about the results! |
Yes, I am still using the same way for loading the dataset so padded_batch
etc. except shuffling since it doesn’t work for me as I said.
I will surely try out the way to divide the layers of the model and hope it
will work. I will keep you updated about that.
Thank you again for your suggestions and help.
-Rishabh
…On Tuesday, July 30, 2019, Paul Andrey ***@***.***> wrote:
Earlier, I was trying to shuffle the dataset using dataset.shuffle()which
fills up the buffer and memory of CPU before running for the first epoch.
Eventually, this buffer filling up process uses all 16 GB RAM of my CPU and
after this nothing runs.
Okay, so basically your dataset (multiplied by embedding dimension) is
huge, and the embedding vectors representing it cannot be all loaded in
memory (either CPU or GPU) at the same time. For your model to run, you
therefore need to keep the amount of data loaded at any given time under 16
GB (which is the limit on both your CPU and GPU, notwithstanding the idea
of using multiple parallel devices).
If we make the (possibly strong) assumption that there is no actual
TensorFlow issue, this should be achievable (with a large embedding
dimension) by tweaking your tf.data.Dataset object - basically, you want
to load a small part of the data (which makes up for a few batches),
shuffle and padded-batch it, feed it to your model, discard it and go on
doing the same with the next bit. Are you still using the dataset from the
tutorial
<https://www.tensorflow.org/beta/tutorials/load_data/text#split_the_dataset_into_text_and_train_batches>?
If so, I will try to write you a bit of code later on tonight.
Now, once this will be dealt with, we should watch for any memory increase
during training (despite the dataflow), which would indicate an issue
similar to mine, but hopefully your problem only comes from loading too
much data at once.
At the end you have mentioned a sample code for placing model layers on
different GPUs, I have a question regarding this. Do you think it will work
similarly like tf.distribute.MirroredStrategy() from here?
From what I understand, tf.distribute.MirroredStrategy() creates a copy
of your model on each available GPU, so as to run parallel fitting on
various batches. If that is indeed the case, the amount of memory used on
the first GPU should not decrease; instead, this strategy will boost
training runtime by distributing the processing of training samples on
parallel copies of the model (and still somehow gather unified weights
update - I guess each would treat part of the batch, then gradients would
be computed based on an aggregate of local losses, or maybe there is some
rougher locally-computed updates aggregations scheme ; I have not looked
into the details).
What I was suggesting (and I have absolutely no idea whether that would
work) was to put distinct parts of the architecture on the various devices.
It might not be possible, or induce data transmission overheads that
greatly slow computations (and possibly do not yield the memory gain I had
in mind) - so, if you try experimenting in that direction, I would be glad
to hear about the results!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#31162?email_source=notifications&email_token=ADZMMPDPP5VBX4VVGG66O2DQCBOIPA5CNFSM4IH3FFHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3EL4SQ#issuecomment-516472394>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADZMMPBBTUPGO7Z6PYH6QL3QCBOIPANCNFSM4IH3FFHA>
.
--
Sent from Gmail iPhone
|
Hi Rishabh, Looking further at the tutorial (whose dataset is way smaller than yours, if I understand well) and doing a few tests on my own, I must say I am surprised by the issues you are encountering... Normally, setting a reasonable buffer_size argument when using shuffle should solve memory issues associated with that part of the data pipeline. As for the embeddings matrix, it is indeed rather big, but when I tried allocating a similarly-shaped one on my system (using alternatively the GPU's 4GB dedicated memory or the 16 GB of RAM I have at my disposal) it fitted with just a slight warning about its size (basically it takes a bit less than 2 GB of memory space, which is a lot for a single weights matrix but should be tractable given your config). Is there any chance you could share your dataset and code with me (optionally via a private channel) so that I can have a look? I suspect there might be something wrong in the dataflow that would explain your going out of memory, but I could be overly optimistic. At any rate, an alternative way of decreasing the embeddings' dimensionality would be to decrease the number of tokens in your vocabulary, e.g. using a WordPiece tokenizer to break uncommon tokens down to knows ones (including phonetic tokens if needed). This way you might end up with a matrix with (way) less rows, which might also be a good thing for your modeling (depending on the amount of rarely-used tokens in the dataset). Best, |
Hi Paul, thank you for your response. My dataset size is 16455928 elements/sentences, so in order to make sure that shuffling is perfect I must use either the same or greater value as the size of my full dataset for buffer_size (mentioned here) otherwise there might be some elements that the model might see never or more than once during an epoch. Regards, |
@rishabhsahrawat Will it be possible to create minimal reproducible code and share with us to move faster. Thanks! |
That is correct, however I guess an imperfect shuffling might be better than no shuffling at all, if the dataset comprises some order - but it might be already shuffled "by nature", or could probably be shuffled on disk outside of TensorFlow. I leave it up to you to see what suits best your usecase!
I just tested again using your exact code; I agree on the number of parameters, but on RAM it still takes "only" 1.4 ~ 1.5 GB of RAM, at least after instantiation and feeding it a Tensor (to ensure the weights are indeed built)... But maybe there is something somewhere in your code that triggers the creation of multiple copies of the weights?
No problem, that was to be expected.
That is indeed partially similar, but there is also the breaking down to phonetics of uncommon tokens that, in my short experience, can greatly decrease vocabulary size - but again, this is merely an abstract suggestion and it is up to you to see whether it suits your needs given your data and application context :) To conclude, it seems that you are encountering a memory issue, whose relying on the mere size of your model on a bug somewhere is still unsure... Edit: after writing the code below I tried running it and I did get out of GPU memory during training (the first batch went well, the second triggered an allocation error). @gadagashwini I leave it up to Rishabh to provide with more details about the model he is using, notably as to the output dimensionality, but I guess a first approximation would be: import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=366856, output_dim=1000, input_length=4),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(1000)),
tf.keras.layers.Dense(500, activation='relu'),
tf.keras.layers.Dense(64, activation='softmax')
])
model.compile('adam', 'sparse_categorical_crossentropy', ['sparse_categorical_accuracy'])
mock_inputs = tf.random.uniform((64, 4), 1, 366856, tf.int64)
mock_target = tf.random.uniform((64, 1), 0, 64, tf.int64)
model.fit(mock_inputs, mock_target, batch_size=32, epochs=5) |
Hi @gadagashwini , I can not share the dataset unfortunately but I can share the model architecture example. These example layers also, from here throws same errors about memory full, GPU exhausted etc. Updated layers look like this-
If I use output_dim <=150 in Embedding layer, then training starts without any problems. This was all on 2 GPUs.
When I use output_dim <= 150, it starts training without errors. |
Hi @pandrey-fr , I just tried your dummy model. As you mentioned, training on single GPU also triggered allocation memory problem one time during epoch 2 like yours and one time during epoch 5. The CPU RAM it takes near 2-3 GBs but as soon as the fit model fit function is executed, the RAM usage increase and stops near 14-15 GBs.
Thank you! |
Hi @rishabhsahrawat, Thank you for the details and feedback ; let us hope the issue gets picked up by someone who can clarify what is going on, and especially disentangle expected behaviours from potential bugs. Just a side note, but as your model's output dimension if the same as the input embedding one (I guess you are doing some kind of language modelling task, e.g. predicting masked or next tokens in a sentence), you might want to re-use the embedding matrix as a kernel on your output layer (this is notably what is done in Transformer models - well, not in the Google tutorial, but in the paper and reference implementations) ; this greatly diminishes the number of trainable weights :) (note that you have to tweak a little to do that, either with a custom layer or the functional API, but it really is not much of a difficulty) |
I am not sure if I understood it correctly, but I think you mean creating something like a 'word2vec' model? If not, can you also share some helpful link to understand it better? |
UPDATE |
What I mean is doing something like this (with a possibly more-general implementation): class SharedKernelSoftmax(tf.keras.layers.Layer):
def __init__(self, kernel, bias_initializer='zeros'):
self.kernel = kernel
self.bias = self.add_weight(
name='bias', shape=(tf.shape(kernel)[1],), dtype=kernel.dtype,
initializer=bias_initializer
)
def call(self, inputs, **kwargs):
output = tf.keras.backend.dot(inputs, self.kernel) + self.bias
return tf.nn.softmax(output, axis=-1) And using such a layer (instantiated by being passed a shared embedding matrix of the output vocabulary) instead of the final softmax in your model. That being said, I know see that your input and output layers have different dimensions (different vocabulary, I guess?), so this is actually not relevant here (it would be if your input and output vocabulary were the same, or if you used a sequence-to-sequence model where tokens in the output vocabulary are also embedded before being passed to a decoder model). Sorry! |
Interesting... Could you monitor RAM usage and share some indications as to the amount of memory used when training on CPU? |
Thank you for clarifying my confusion. As you have noticed the input and output layers do not have same dimensions. So it will not be useful, however since I am tokenizing the data so I am saving encodings to a token file which I can load later without tokenizing the whole dataset on each run. I am now training the model with output_dim 1000 on CPU and I can see most of the memory is taken but the unused memory is changing to sometimes 3 MB, 600 MB and sometimes even 3 GBs. It is still running the first epoch right now. The %CPU usage is going even upto 1000. I am using |
@ymodak @gadagashwini I just found out if I train on 2 GPUs then it takes 1s per step to train but when I use all 4 GPUs it takes 2s per step. I think it should train faster on more GPUs. |
If you use distribution strategy, could you try |
Hi @yuefengz , thank you for your suggestion. I tried this by implementing it in the same way as
Read about it from here too.
The error log is huge, so sharing only the last part of it. If you require, I can share the full log. |
I have a question from all of you, @ymodak @gadagashwini @yuefengz , do you think if I use Tf1.14 I can achieve the task of training embedding layer with bigger dimension on multiple GPUs efficiently without any problems that I am facing in TF2.0?. If yes, then could you please share some helpful links for converting the code into TF 1.x. I am new in TF 1.x. Thank you! |
I have another question. I want to continue training after loading the lastly saved model for which I am defining
Even if I choose a single GPU, it still ends up with the same error. If I do not use |
@rishabhsahrawat You might want to try In my limited experience, if you are using the keras API, you do not need to change that much things to run code compatible with TF1.14. The main changes are Eager being disabled (which disables a few things and requires stricter implementation of some others, notably as to type casting, which is somewhat more flexible with Eager enabled), and the necessity, when not using keras, to manually set up placeholders, session objects and initialization instructions; but the latter points are automated through keras backend, so you probably will not have to go through them. Good luck, and if you run into specific problems, do ask for support :-) |
Hi @ymodak any updates on the issue?. I am still struggling with it. |
For the problem with |
With the latest nightly build, I am able to use larger dimension (max 1000) but not bigger. Also, after raising the issue I started building the model in Tf1.14 and it also has the same issue only so value upto 400 works but not higher. |
Hi Rishabh, Generally, an embedding layer in Keras uses more RAM than you would expect. To get the RAM requirements do: no. tokens * no. dimensions * dtype * approx. 15. So for 400k tokens, 1k dimensions, 32 bit dtype you would need about 1.6 * 15 which is about 24 GB. This probably loads on the CPU because it is able to use your swapfile. If the weights were simply in a NumPy array it would only take up about 1.6 GB but Keras / TF is really greedy. I really wish it were less greedy because it would be so much faster to start up a model =( |
@rishabhsahrawat Closing as this is not a bug, but feature request for supporting sharded embeddings in sync training. Please raise a feature request as its working intended. Thanks!. |
I have got the same error when using CentralStorageStrategy (2.4.1). It can be reproduced by changing the strategy in https://github.com/tensorflow/docs/blob/master/site/en/tutorials/distribute/parameter_server_training.ipynb. It only appears when I used multiple gpus and embedding in my model. For single gpu, it works, and for models that do not have embedding, it also works. Log:
@yuefengz has it been fixed? |
@liyinhgqw @yuefengz do you fix it? I got the same error changing MirroredStrategy to CentralStorageStrategy because of OOM, on python3.7, tensorflow2.4, linux. |
I am using TF2.0 latest nightly build and I am trying to train LSTM model for text classification on very large dataset of 16455928 sentences. For embedding layer in the model, I have a vocab size of 366856 and I used 1000 as embedding dimension value in it, on which the 2 GPUs(Tesla T4 from Google) ran out of memory.
Since I can not lower the size of vocabulary (maybe there is a way), so I used lower value for embedding dimension (100) on which the model starts training. Now my question is if there is a way I can use higher value of embedding dimension?. Maybe by putting set of layers of my model on different GPUs, if so then what is the way in TF2.0? Also, will using more number GPUs help? Thank you!
The text was updated successfully, but these errors were encountered: