Error while Training Dall-E on a single TPU (8cores) #9

mkhoshle · 2021-07-21T13:04:40Z

Hi, I am trying to train Dall-e on COCO dataset and here are the parameters I use:

%%writefile /content/tmp/run.sh
#@title Configuration
# model
model="vqgan" #@param  ['vqgan','evqgan','gvqgan','vqvae','evqvae','gvqvae','vqvae2']
# training
epochs=30 #@param {'type': 'raw'}
learning_rate=4.5e-6 #@param {'type': 'number' }
precision=16 #@param {'type': 'integer' }
batch_size=8 #@param {'type': 'raw'}
num_workers=8 #@param {'type': 'raw'} 
# fake_data=True #@param {'type': 'boolean' }
use_tpus=True #@param {'type': 'boolean' }


# modifiable
resume=False #@param {type: 'boolean'}
dropout=0.1 #@param {type: 'number'}
rescale_img_size=256 #@param {type: 'number'}
resize_ratio=0.75 #@param {type: 'number'}
# test=True #@param {type: 'boolean'}
seed=8675309
codebook_dim=1024
embedding_dim=256

python '/content/dalle-lightning-modified-/train_dalle.py' \
    --epochs $epochs \
    --learning_rate $learning_rate \
    --precision $precision \
    --batch_size $batch_size \
    --num_workers $num_workers \
    --use_tpus \
    --train_dir "/content/data/train/" \
    --val_dir "/content/data/test" \
    --vae_path "/content/vae_logs/last.ckpt"  \
    --log_dir "/content/dalle_logs/" \
    --img_size $rescale_img_size \
    --seed $seed \
    --resize_ratio $resize_ratio \
    --embedding_dim $embedding_dim \
    --codebook_dim $codebook_dim

When running I get the following error:

WARNING:root:TPU has started up successfully with version pytorch-1.9
Global seed set to 8675309
GPU available: False, used: False
TPU available: True, using: 8 TPU cores
IPU available: False, using: 0 IPUs
Setting batch size: 8 learning rate: 4.50e-06

Global seed set to 8675309
Global seed set to 8675309
Global seed set to 8675309
Global seed set to 8675309
Global seed set to 8675309
Global seed set to 8675309
Global seed set to 8675309
Global seed set to 8675309

  | Name          | Type                     | Params
-----------------------------------------------------------
0 | text_emb      | Embedding                | 5.3 M 
1 | image_emb     | Embedding                | 4.2 M 
2 | text_pos_emb  | Embedding                | 131 K 
3 | image_pos_emb | AxialPositionalEmbedding | 32.8 K
4 | vae           | OpenAIDiscreteVAE        | 97.6 M
5 | transformer   | Transformer              | 268 M 
6 | to_logits     | Sequential               | 9.5 M 
-----------------------------------------------------------
288 M     Trainable params
97.6 M    Non-trainable params
385 M     Total params
771.301   Total estimated model params size (MB)
Exception in device=TPU:5: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Exception in device=TPU:3: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Exception in device=TPU:7: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Exception in device=TPU:0: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Exception in device=TPU:1: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Exception in device=TPU:2: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Exception in device=TPU:6: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch
Exception in device=TPU:4: `Dataloader` returned 0 length. Please make sure that it returns at least 1 batch

One process sees the folder of 94629 images and texts and the rest see 0 images and texts. I do not understand why this is happening. Could you please help me with this? Any idea?

The text was updated successfully, but these errors were encountered:

tgisaturday · 2021-07-21T22:04:42Z

@mkhoshle Your code seems to fail loading coco dataset. Make sure everything runs okay with --fake_data flag and check --train_dir, --val_dir.

mkhoshle · 2021-07-22T07:14:11Z

@tgisaturday No the issue is not loading. I can see that one process sees the folder of 94629 images and texts and the rest see 0 images and texts which is weird. --fake_data flag is set to False and the directory paths are correct.

tgisaturday · 2021-07-22T08:11:11Z

@mkhoshle I've tested with cc3m and coco and there weren't any similar symptoms. Double check all your settings and show me how to reproduce.

mkhoshle · 2021-07-22T10:23:23Z

@tgisaturday Here is my code. You can see the error in my colab notebook:
https://colab.research.google.com/drive/1c9ttTLYbfhjJ59JM_f5XZkF0vMl7pAob?usp=sharing

tgisaturday · 2021-07-22T11:17:44Z

@mkhoshle You have to use pytorch-lighting datamodule to run the code without any problem. I can't debug every custom codes which are not following the entire framework.

mkhoshle · 2021-07-22T11:24:38Z

@tgisaturday I have followed your code examples to do this. What do you mean I need to use pytorch-lighting datamodule to run the code. Do you mean that torch_xla should not be installed and it should only be based on pytorch-lighting? Is not pytorch-lighting dependent on `torch_xla?

tgisaturday · 2021-07-22T11:28:52Z

@mkhoshle Not using TextImageDataModule here can cause problems.

dalle-lightning/pl_dalle/loader.py

Line 116 in 987a581

class TextImageDataModule(LightningDataModule):

For example, using pure torchvision Dataset class and just feeding only dataloader to lightning Trainer causes OOM in large pods.
Lightning-AI/pytorch-lightning#8358 (comment)

If this is not the case, start debugging using only one TPU core. Sometimes hidden error gets revealed.

afiaka87 · 2021-07-22T22:57:03Z

@tgisaturday i believe they're using the colab notebook from the repository if you're not aware. Or a rendition of it.

tgisaturday · 2021-07-22T23:34:38Z

@tgisaturday No the issue is not loading. I can see that one process sees the folder of 94629 images and texts and the rest see 0 images and texts which is weird. --fake_data flag is set to False and the directory paths are correct.

The reason why only one process sees the folder of 94629 images is that you have set num_workers to 1. num_worker is number of processes which handles dataloading in multi-processing manner. Nothing related to number of TPU cores. However, none of TPUs from 0-7 are fed with data. This can be device allocation error, dataloader error, or others that are not visible in current colab notebook.

tgisaturday closed this as completed Jul 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while Training Dall-E on a single TPU (8cores) #9

Error while Training Dall-E on a single TPU (8cores) #9

mkhoshle commented Jul 21, 2021

tgisaturday commented Jul 21, 2021

mkhoshle commented Jul 22, 2021

tgisaturday commented Jul 22, 2021

mkhoshle commented Jul 22, 2021

tgisaturday commented Jul 22, 2021

mkhoshle commented Jul 22, 2021 •

edited

Loading

tgisaturday commented Jul 22, 2021 •

edited

Loading

afiaka87 commented Jul 22, 2021

tgisaturday commented Jul 22, 2021

Error while Training Dall-E on a single TPU (8cores) #9

Error while Training Dall-E on a single TPU (8cores) #9

Comments

mkhoshle commented Jul 21, 2021

tgisaturday commented Jul 21, 2021

mkhoshle commented Jul 22, 2021

tgisaturday commented Jul 22, 2021

mkhoshle commented Jul 22, 2021

tgisaturday commented Jul 22, 2021

mkhoshle commented Jul 22, 2021 • edited Loading

tgisaturday commented Jul 22, 2021 • edited Loading

afiaka87 commented Jul 22, 2021

tgisaturday commented Jul 22, 2021

mkhoshle commented Jul 22, 2021 •

edited

Loading

tgisaturday commented Jul 22, 2021 •

edited

Loading