Optimizing for low memory usage. #129

cloneofsimo · 2023-01-09T20:13:19Z

cloneofsimo
Jan 9, 2023
Maintainer

So this is the part I didn't really care about, but I think I need to optimize for memory performance. However, machine I'm using has 3090, so I can't verify any claim on GPU usage on my own.
I know for a fact that using LoRA reduces memory usage, but I am not familiar with further techniques other then theoretical ones.

So @d8ahazard 's repo has huge discussions related to reducing memory usage. Seems like xformer + bnb is a huge deal.
@kohya-ss 's training pipeline can be a help as well. Just throwing ideas.

https://github.com/kohya-ss/sd-scripts
https://github.com/d8ahazard/sd_dreambooth_extension/discussions?discussions_q=memory

2023/01/10

Currently, with gradient checkpointing, and it seems like 12163MiB is enough, can someone confirm this?

d8ahazard · 2023-01-09T20:47:04Z

d8ahazard
Jan 9, 2023

I can do around 6 using the right optimizations.

…

On Mon, Jan 9, 2023, 2:22 PM Simo Ryu ***@***.***> wrote: So this is the part I didn't really care about, but I think I need to optimize for memory performance. However, machine I'm using has 3090, so I can't verify any claim on GPU usage on my own. I know for a fact that using LoRA reduces memory usage, but I am not familiar with further techniques other then theoretical ones. So @d8ahazard <https://github.com/d8ahazard> 's repo has huge discussions related to reducing memory usage. Seems like xformer + bnb is a huge deal. @kohya-ss <https://github.com/kohya-ss> 's training pipeline can be a help as well. Just throwing ideas. https://github.com/kohya-ss/sd-scripts https://github.com/d8ahazard/sd_dreambooth_extension/discussions?discussions_q=memory 2023/01/10 Currently, with gradient checkpointing, and it seems like 12163MiB is enough, can someone confirm this? — Reply to this email directly, view it on GitHub <#129>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMO4NC7SM4I77V5OOUOPCTWRRXPBANCNFSM6AAAAAATV2ZNXU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

2 replies

cloneofsimo Jan 9, 2023
Maintainer Author

That is for training both Unet + CLIP right?

brian6091 Jan 9, 2023
Collaborator

Here's from the last time I recorded LoRA with Unet + CLIP:

train_batch_size: 1, gradient_checkpointing: false
Steps: 0% 21/10000 [00:17<1:26:52, 1.91it/s, GPU=10378]

train_batch_size: 4, gradient_checkpointing: false
Steps: 1% 69/10000 [00:53<1:57:07, 1.41it/s, GPU=17526]

This drops when I enable gradient_checkpointing. I'll update when I try again.

d8ahazard · 2023-01-09T21:03:36Z

d8ahazard
Jan 9, 2023

Yes.

…

On Mon, Jan 9, 2023, 3:02 PM Simo Ryu ***@***.***> wrote: That is for training both Unet + CLIP right? — Reply to this email directly, view it on GitHub <#129 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMO4NH5CVR2UCRN47AGVVLWRR4HPANCNFSM6AAAAAATV2ZNXU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

lolxdmainkaisemaanlu · 2023-01-12T10:38:11Z

lolxdmainkaisemaanlu
Jan 12, 2023

I am using bmaltais's kohya repo and I'm successfully running lora on my 1060 6GB. Using the default settings from https://github.com/bmaltais/kohya_ss.

It seems to be performing well but i'm a noob so I don't know how to use tensorboard and show stats etc

0 replies

kohya-ss · 2023-01-12T13:10:11Z

kohya-ss
Jan 12, 2023

Hi, thank you for great work!

xformers and 8-bit Adam (bitsandbytes) are very effective to reduce memory usage in my experience.

In addition, gradient checkpointing is also useful, however it is required to set the model to training mode with model.train(). I found this in the example of Textual Inversion in Diffusers.

https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion.py#L507

This seems to finally reduce the amount of memory required for batch size 1 from 8gb to 6gb.

4 replies

brian6091 Jan 12, 2023
Collaborator

@kohya-ss that's a good catch, but I'm confused since in the Textual Inversion script the unet has no gradients calculated. Why would enabling checkpointing on the unet even matter?

kohya-ss Jan 13, 2023

I'm confused as well...
I'm not familiar with PyTorch so I don't know, but gradient checkpointing option seems to be working.

bonlime Jan 14, 2023

@brian6091 even in TI you still have to do backprop through unet first, for which you need to store the activations. gradient checkpointing recalculates some of the activations which is why it requires less memory

brian6091 Jan 14, 2023
Collaborator

@bonlime Thanks for the explanation. Makes sense. Does that mean that checkpointing is disabled when you put a model into eval() mode?

cloneofsimo · 2023-01-13T22:14:17Z

cloneofsimo
Jan 13, 2023
Maintainer Author

Thank you for the help @kohya-ss , @d8ahazard !!

0 replies

krahnikblis · 2023-02-10T05:58:37Z

krahnikblis
Feb 10, 2023

hey @cloneofsimo one thing that REALLY helps me (i'm using Colab TPU, so limited to 8GB basically, even though it's 8x of those, they're replicated), is this thing they're calling "latent caching". basically, you don't need the VAE during training and thus you don't need full images/pixels, so that's not only the ~300MB of parameters you can leave out, it's also the savings on the processing of it and maybe grads around it, and the 24X savings you get from using the latents instead of the pixels as your input data set...

here's how i did it (which is a little different than the PR example in the hf lib):

loop through all training image files, if there's not a latent file of the same name, make it using the VAE - from a 512x512x3 image you get i think a 64x64x4 (i forgot exactly, it's some multiple + some amount vs original image size) latent - roughly 24 of the latents take up the same memory as one image. in the setup loop, save the latent (i use numpy format) if it's not there, so next time it's faster to load.
load the latent instead of the pixel_values, change naming convention if you're a purist like me... since it's so small in size, you don't need a torch loader (or TF or whatever) - i am doing it with just PIL to open the files, and numpy to save the latents. the latents can just sit in regular RAM and you can random sample from them as your "loader" to your device (in torch i guess you use latents.to('cuda'), in jax i'm using device_put(latents))
in the train step function, you don't have the VAE, just create/add the noise and feed stuff into the text encoder and unet.

makes everything run faster in general, and was one of the things i've implemented to get rid of OOM errors.

other thing was to completely split apart the params to train - the ones to get LORA treatment are the ones i pass to the optimizer, and the rest are effectively frozen, so i put those in completely separate variables, and just unify everything on the fly during train step. this way, system doesn't get mad about contiguous memory not available (unet params are big, so stashing the non-attention parts first/separate has helped me overcome the OOM issues).

and, like my suspicion in the other thread about extraction of fine-tune models, i have a suspicion that we can compress the $#!+ out of model parameters using this same method. basically every ND array could get SVD-low-ranked, right? in the Unet, the smallest native rank i think is 320 for SD1.5... so there's a lot of room to compress there... which means the full model could be run on smaller/leaner devices. i mean, i know some of the weights need to keep their full dims because they're basically positionally-aligned convolutions, but there's probably a lot that can be compressed... i think we'd need some smart algo to check the S values, to dynamically find the rank for each layer that holds "enough" info, such that the final model output is still quality. i plan to experiment with this at some point... but like that other thread, folks smarter than me getting to it first is probably better for everyone hehe

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing for low memory usage. #129

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Optimizing for low memory usage. #129

cloneofsimo Jan 9, 2023 Maintainer

2023/01/10

Replies: 6 comments · 6 replies

d8ahazard Jan 9, 2023

cloneofsimo Jan 9, 2023 Maintainer Author

brian6091 Jan 9, 2023 Collaborator

d8ahazard Jan 9, 2023

lolxdmainkaisemaanlu Jan 12, 2023

kohya-ss Jan 12, 2023

brian6091 Jan 12, 2023 Collaborator

kohya-ss Jan 13, 2023

bonlime Jan 14, 2023

brian6091 Jan 14, 2023 Collaborator

cloneofsimo Jan 13, 2023 Maintainer Author

krahnikblis Feb 10, 2023

cloneofsimo
Jan 9, 2023
Maintainer

Replies: 6 comments 6 replies

d8ahazard
Jan 9, 2023

cloneofsimo Jan 9, 2023
Maintainer Author

brian6091 Jan 9, 2023
Collaborator

d8ahazard
Jan 9, 2023

lolxdmainkaisemaanlu
Jan 12, 2023

kohya-ss
Jan 12, 2023

brian6091 Jan 12, 2023
Collaborator

brian6091 Jan 14, 2023
Collaborator

cloneofsimo
Jan 13, 2023
Maintainer Author

krahnikblis
Feb 10, 2023