-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable distributed sample image generation on multi-GPU enviroment #1061
Enable distributed sample image generation on multi-GPU enviroment #1061
Conversation
Modifying to attempt enable multi GPU inference
additional VRAM checking, refactor check_vram_usage to return string for use with accelerator.print
remove sample image debug outputs
Thanks for the PR. However, do we really need to distribute the sample image generation across multiple GPUs? I don't think it would take that long. I am worried about increasing the complexity of the code. |
Honestly? Most use cases where there is only one or two sample images, probably not much effect. However, in scenarios where we are training multiple concepts or are OCD like me and have multiple sample prompts, the ability to spread the load to the idle GPUs does speed up the down time during sample generation. Especially considering that SDXL (and possibly future more complex and heavy models) does require more steps to generate images. For context, using samplers like k_dpm_2_a and k_dpm_2 doubles the number of steps automatically, and when some SDXL models recommend sampling for 30~60 steps, that means that a single image can, depending on the GPU and image resolution, take around 5 mins to render. Additionally, for users like me running on the available free resources like colab and kaggle, every minute we can save means one more minute we can put to training LoRAs and models. |
simplify per process prompt
I am still skeptical about distributed sample generation. If we are doing a large-scale training, we would probably have separate resources to evaluate the saved models, and with Colab and Kaggle, we would probably want to do more steps of training instead of sample output... However, I think the code is much simpler now. May I merge this PR? However, there are a few points of concern now, and I would rewrite the code after merging, even if it is redundant, for the sake of clarity. I would appreciate your understanding and would ask you to test the changes again. |
Sure! Of course you are free to modify the code as you like! It's your code that I modified in the first place. |
I merged this to the new branch Thank you again for this PR! |
Hey, Did a test run in Kaggle enviroment on textual inversion, and at the sample at first run, I encountered VRAM OOM when it when to the latents to image step. As mentioned in #1019 the workaround I came up with was to insert a call to torch.cuda.empty_cache() after the latents have been generated, and before the latents are converted into images. like below:
|
Testing on Kaggle when training LoRAs works fine though |
Thank you for testing! I've added cuda.empty_cache at that position. |
yesterday i helped one of my patreon supporter and he had dual rtx 4090 on linux 1 gpu training speed was 1.2 it / s when 2 gpus used the training speed dropped to 2 second / it literally became slower than single card cumulatively |
Revert bitsandbytes-windows update
Modified the sample_images_common function to make use of Accelerator PartialState to feed the list of sample images prompt to all available GPUs.
Tested working fine in single GPU Google Colab enviroment and dual GPU Kaggle enviroment.
Possible side effect of using mutliple GPUs to generate sample images is that the file creation time may not sync with the order of the prompt from the original prompt file. Attempted some mitigation by spliting prompts to passed to each GPU process in the in the order that the GPU process is called. However, if the sample image prompts have different samplers and/or number of steps, this would likely break the workaround as generation times would be out of sync.
Might be able to artificially force syncronization by making the sample image process wait for all other processes to complete the image generation step before continuing to the next sample image by using
accelerator.wait_for_everyone()
but I imagine efficient use of GPU time would be more important than perfectly sorted sample images based on image creation time.