-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error while Training Dall-E on a single TPU (8cores) #9
Comments
@mkhoshle Your code seems to fail loading coco dataset. Make sure everything runs okay with --fake_data flag and check --train_dir, --val_dir. |
@tgisaturday No the issue is not loading. I can see that one process sees the folder of 94629 images and texts and the rest see 0 images and texts which is weird. --fake_data flag is set to False and the directory paths are correct. |
@mkhoshle I've tested with cc3m and coco and there weren't any similar symptoms. Double check all your settings and show me how to reproduce. |
@tgisaturday Here is my code. You can see the error in my colab notebook: |
@mkhoshle You have to use pytorch-lighting datamodule to run the code without any problem. I can't debug every custom codes which are not following the entire framework. |
@tgisaturday I have followed your code examples to do this. What do you mean I need to use |
@mkhoshle Not using TextImageDataModule here can cause problems. dalle-lightning/pl_dalle/loader.py Line 116 in 987a581
For example, using pure torchvision Dataset class and just feeding only dataloader to lightning Trainer causes OOM in large pods. If this is not the case, start debugging using only one TPU core. Sometimes hidden error gets revealed. |
@tgisaturday i believe they're using the colab notebook from the repository if you're not aware. Or a rendition of it. |
The reason why only one process sees the folder of 94629 images is that you have set num_workers to 1. num_worker is number of processes which handles dataloading in multi-processing manner. Nothing related to number of TPU cores. However, none of TPUs from 0-7 are fed with data. This can be device allocation error, dataloader error, or others that are not visible in current colab notebook. |
Hi, I am trying to train Dall-e on COCO dataset and here are the parameters I use:
When running I get the following error:
One process sees the folder of 94629 images and texts and the rest see 0 images and texts. I do not understand why this is happening. Could you please help me with this? Any idea?
The text was updated successfully, but these errors were encountered: