Optimizing for low memory usage. #129
Replies: 6 comments 6 replies
-
I can do around 6 using the right optimizations.
…On Mon, Jan 9, 2023, 2:22 PM Simo Ryu ***@***.***> wrote:
So this is the part I didn't really care about, but I think I need to
optimize for memory performance. However, machine I'm using has 3090, so I
can't verify any claim on GPU usage on my own.
I know for a fact that using LoRA reduces memory usage, but I am not
familiar with further techniques other then theoretical ones.
So @d8ahazard <https://github.com/d8ahazard> 's repo has huge discussions
related to reducing memory usage. Seems like xformer + bnb is a huge deal.
@kohya-ss <https://github.com/kohya-ss> 's training pipeline can be a
help as well. Just throwing ideas.
https://github.com/kohya-ss/sd-scripts
https://github.com/d8ahazard/sd_dreambooth_extension/discussions?discussions_q=memory
2023/01/10
Currently, with gradient checkpointing, and it seems like 12163MiB is
enough, can someone confirm this?
—
Reply to this email directly, view it on GitHub
<#129>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMO4NC7SM4I77V5OOUOPCTWRRXPBANCNFSM6AAAAAATV2ZNXU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Yes.
…On Mon, Jan 9, 2023, 3:02 PM Simo Ryu ***@***.***> wrote:
That is for training both Unet + CLIP right?
—
Reply to this email directly, view it on GitHub
<#129 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMO4NH5CVR2UCRN47AGVVLWRR4HPANCNFSM6AAAAAATV2ZNXU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I am using bmaltais's kohya repo and I'm successfully running lora on my 1060 6GB. Using the default settings from https://github.com/bmaltais/kohya_ss. It seems to be performing well but i'm a noob so I don't know how to use tensorboard and show stats etc |
Beta Was this translation helpful? Give feedback.
-
Hi, thank you for great work! xformers and 8-bit Adam (bitsandbytes) are very effective to reduce memory usage in my experience. In addition, gradient checkpointing is also useful, however it is required to set the model to training mode with model.train(). I found this in the example of Textual Inversion in Diffusers. This seems to finally reduce the amount of memory required for batch size 1 from 8gb to 6gb. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the help @kohya-ss , @d8ahazard !! |
Beta Was this translation helpful? Give feedback.
-
hey @cloneofsimo one thing that REALLY helps me (i'm using Colab TPU, so limited to 8GB basically, even though it's 8x of those, they're replicated), is this thing they're calling "latent caching". basically, you don't need the VAE during training and thus you don't need full images/pixels, so that's not only the ~300MB of parameters you can leave out, it's also the savings on the processing of it and maybe grads around it, and the 24X savings you get from using the latents instead of the pixels as your input data set... here's how i did it (which is a little different than the PR example in the hf lib):
makes everything run faster in general, and was one of the things i've implemented to get rid of OOM errors. other thing was to completely split apart the params to train - the ones to get LORA treatment are the ones i pass to the optimizer, and the rest are effectively frozen, so i put those in completely separate variables, and just unify everything on the fly during train step. this way, system doesn't get mad about contiguous memory not available (unet params are big, so stashing the non-attention parts first/separate has helped me overcome the OOM issues). and, like my suspicion in the other thread about extraction of fine-tune models, i have a suspicion that we can compress the $#!+ out of model parameters using this same method. basically every ND array could get SVD-low-ranked, right? in the Unet, the smallest native rank i think is 320 for SD1.5... so there's a lot of room to compress there... which means the full model could be run on smaller/leaner devices. i mean, i know some of the weights need to keep their full dims because they're basically positionally-aligned convolutions, but there's probably a lot that can be compressed... i think we'd need some smart algo to check the S values, to dynamically find the rank for each layer that holds "enough" info, such that the final model output is still quality. i plan to experiment with this at some point... but like that other thread, folks smarter than me getting to it first is probably better for everyone hehe |
Beta Was this translation helpful? Give feedback.
-
So this is the part I didn't really care about, but I think I need to optimize for memory performance. However, machine I'm using has 3090, so I can't verify any claim on GPU usage on my own.
I know for a fact that using LoRA reduces memory usage, but I am not familiar with further techniques other then theoretical ones.
So @d8ahazard 's repo has huge discussions related to reducing memory usage. Seems like xformer + bnb is a huge deal.
@kohya-ss 's training pipeline can be a help as well. Just throwing ideas.
https://github.com/kohya-ss/sd-scripts
https://github.com/d8ahazard/sd_dreambooth_extension/discussions?discussions_q=memory
2023/01/10
Currently, with gradient checkpointing, and it seems like 12163MiB is enough, can someone confirm this?
Beta Was this translation helpful? Give feedback.
All reactions