Dropout and Max Norm Regularization for LoRA training #545

AI-Casanova · 2023-05-29T02:50:47Z

This PR adds Dropout and Max Norm Regularization [Paper] to train_network.py

Dropout randomly removes some weights/neurons from calculation on both the forward and backward passes, effectively training many neural nets successively like so:

This encourages the LoRA to diversify its training, instead of only picking a few weights to continuously update, hopefully reducing overtraining.

Max Norm Regularization calculates the L2 norm of the weights at each key and if they exceed the cutoff, scales the entire key by a factor to bring them in line, (mentioned in section 5.1 of the paper) This works because the relationships between weights in a layer seem to be more important than the total magnitude.
When enabled, adds logging for TensorBoard, and an average norm value and number of keys scaled each step to the progress bar.

Either option can be used independently:

Dropout suggested setting >0.3
Max Norm suggested setting = 1 (You can also set it high enough to never trigger ie 10 to watch Tensor Board and see where a good point to set it at might be)

Example of training with dropout [0.5,0.25,0.10.05,0] and Max Norm 1 all other settings deterministic

Notes for @kohya-ss
Dropout requires Xformers, and I didn't know how you wanted to do the assertion for that
Dropout requires the Cutlass kernel for GPUs with Capability <8 (A100, 4090 etc) this has been tested on a Colab T4 with xformers 0.0.19, no idea minimum requirements.
I believe I passed dropout in a way that won't interfere with the other trainers (dropout=None) in the function call, but should be checked.
Also I'm currently scaling lora_up and lora_down by ratio**0.5 as a scalar should be commutative when multiplied to a matrix multiplication (ie matmul(r*A, B) = matmul(A, B*r), matmul(sqrt(r)*A, sqrt(r)*B)) , will do further testing to confirm whether to remain this way or only multiply up or down by the full ratio.

kohya-ss · 2023-05-29T11:39:09Z

Thank you for your pull request! It is quite interesting.

I seldom employ constraints for training, but I think your proposed Max Norm Regularization seems effective. Typically, constraints are applied to individual layers, but in your proposed code, the constraints are applied to the calculated values of the up and down weights of LoRA, which is interesting. I think this will work.

About the dropout, I am working on implementing it as well. However, I am using a different method.

In your proposal, it seems like you have specified the dropout argument for the attention in xformers. I have a little concern about this part.

Because if we are applying dropout in the attention mechanism, I think the dropout is being applied to the calculation results that include the original weights of Q/K/V after LoRA is applied. As a result, if we perform dropout at this point, won't LoRA attempt to learn the weights of the original Q/K/V linear layers that were dropped out as well? This means that the effects of LoRA may be more pronounced during inference.

Personally, I think it might be more common to apply dropout in the manner of x = lora_up(dropout(lora_down(x))).

Looking at the xformers source code, GPUs that can apply dropout seem to be limited as you've written. Therefore, if we're going to apply dropout in this way, it might be better to explicitly state that the conditions under which this can be used are limited.

On my end, the implementation I'm working on involves applying dropout after lora_down, as mentioned before.

I would appreciate it if you could confirm about the dropout as mentioned above.

AI-Casanova · 2023-05-29T12:55:18Z

You may be correct on the issue of dropout. The main thrust of this PR was originally Max Norm, as I hadn't yet figured out that Cutlass kernel supported other GPUs. I will look into your suggestions there.

I'm fairly confident that Max Norm will work as expected, as I've been experimenting in post-processing LoRA and full merged models, and have found that key-wise rescaling can help reduce the effects of overtraining on both.

This Publicly available LoRA originally had a max key norm of about 6, when rescaled to 1 (center columns) the effect was definitely reduced, but its interaction with other LoRA (not shown) was improved greatly. When applied during training, a norm constraint doesn't seem as drastic, as other weights are encouraged to be explored.

As a side note, in full models, Add Difference merges seem to blow up norms, leading to burning at lower and lower CFG, another motivation for keeping a LoRA layer from being overpowered.

AI-Casanova · 2023-05-29T14:01:12Z

You are absolutely correct

return self.org_forward(x) + self.lora_up(torch.nn.functional.dropout(self.lora_down(x),p=0.50)) * self.multiplier * self.scale

is absolutely the way to go.

p=[0.50,0.25,0.10,0.05]
alt is your method
Same settings/seed for everything else.

kohya-ss · 2023-05-29T23:15:55Z

Max Norm looks great! I have been annoyed by over-application when applying multiple LoRA's, so I would be happy if it improves that, even if only slightly.

Also, thank you very much for the examination of the dropout! I had not verified it in detail myself, so it is very helpful.
I am considering how to apply dropout to the network carefully (I want to take into account the impact on LyCORIS, etc.), and I am considering specifying it in --network_args.
So I may change the dropout part after the merging.

I will check and merge as soon as I have time!

IdiotSandwichTheThird · 2023-05-30T04:29:31Z

This Publicly available LoRA originally had a max key norm of about 6, when rescaled to 1 (center columns) the effect was definitely reduced, but its interaction with other LoRA (not shown) was improved greatly.

Do you have any code available for rescaling max key norms on pre-trained lora? It sounds pretty interesting.

XZiar · 2023-05-31T23:52:34Z

I remmeber when applying lora, weights will be scaled with alpha/dim, so maybe the maxnorm should also consider this?

kohya-ss · 2023-06-01T10:20:30Z

I remmeber when applying lora, weights will be scaled with alpha/dim, so maybe the maxnorm should also consider this?

I think this makes sense.

@AI-Casanova I would like to add this modification, please let me know if you have any suggestions.

kohya-ss · 2023-06-01T10:42:25Z

I'd like to modify like this:

        down = state_dict[downkeys[i]].to(device)
        up = state_dict[upkeys[i]].to(device)
        alpha = state_dict[alphakeys[i]].to(device)
        dim = down.shape[0]
        scale = alpha / dim

        if up.shape[2:] == (1, 1) and down.shape[2:] == (1, 1):
            updown = (up.squeeze(2).squeeze(2) @ down.squeeze(2).squeeze(2)).unsqueeze(2).unsqueeze(3)
        elif up.shape[2:] == (3, 3) or down.shape[2:] == (3, 3):
            updown = torch.nn.functional.conv2d(down.permute(1, 0, 2, 3), up).permute(1, 0, 2, 3)
        else:
            updown = up @ down

        updown *= scale

AI-Casanova · 2023-06-01T12:32:14Z

Ah scale the updowns before the norm, looks good to me.

kohya-ss · 2023-06-01T14:24:48Z

Thank you! I'm working on max norm and dropout on dev branch.

AI-Casanova · 2023-06-01T15:38:13Z

I see rank and module dropout, very nice! Looking forward to exploring the new hyperparameter space this creates.

TingTingin · 2023-06-02T09:48:35Z

Is it correct to say that max norm is similar to alpha setting but on a per key level? if so how does this interact with alpha setting

DuroCuri · 2023-08-01T10:10:09Z

Dropout requires the Cutlass kernel for GPUs with Capability <8 (A100, 4090 etc) this has been tested on a Colab T4 with xformers 0.0.19, no idea minimum requirements. Is this means, if I am using rtx 3090, I can not use dropout? It is depressed for me...

feffy380 · 2023-09-19T12:46:51Z

Max Norm suggested setting = 1

Where does this recommendation come from? The linked paper suggests a typical range of 3-4 and Keras has a default of 2.

AI-Casanova · 2023-09-19T13:33:28Z

@feffy380 limited empirical testing on SD 1.5.

I would call it more of a starting value for grid search than a suggested value.

Too low and all of your keys are constrained, too high and it has no effect.

rkfg · 2023-09-28T17:30:55Z

In my experience 1 is the best setting, but most importantly I can tell the lora is overtrained when it starts scaling keys! So I then adjust the alpha in such a way that max_key_norm hits the ceiling (which is 1 in this case) but doesn't push further causing key scaling. I use cosine with restarts scheduling and it takes a few attempts to choose the right alpha so that max_key_norm stops growing roughly when the learning rate drops low. But then the lora comes out absolutely perfect, I've never had such great results before! Very flexible and yet perfectly captures the subject.

If I keep it running with multiple keys getting scaled, the results are not as burned as without max norm regularization but still it's pretty obvious the model is overtrained. It captures the noise from the original images, doesn't follow the prompt just as good and has other defects. Totally better than without this regularization, but it's even better if it hits the sweet spot.

I now think this metric is way better than loss which is useless for loras anyway (too statistically insignificant, basically shows the ability to predict one noise step on one random image from the dataset, it's nothing). But max_key_norm in tensorboard directly shows how well trained the model is. Maybe I'm overselling it but really I just trained on two datasets and I've never had better results before while I was doing it for months.

feffy380 · 2023-09-28T20:07:09Z

@rkfg If you're adjusting settings so it never scales keys, max norm regularization isn't doing anything since scaling keys is its whole thing

AI-Casanova · 2023-09-28T20:10:37Z

@feffy380 he's using it as a gauge, which also can be done by setting a higher norm, and I believe tensorboard should be collecting max norm data, but it's been a while since I set it up. (edit: I reread his message and yes, max norm is recorded if any max norm value is set)

Being aware of the max norm is just as valid, if not perhaps more so than forcing it down via my regularization

AI-Casanova · 2023-09-28T20:15:09Z

@rkfg there's some interesting discussion about loss over here if you're interested in the nitty gritty #294 (reply in thread)

rkfg · 2023-09-28T20:17:16Z

@AI-Casanova yes, I use it exactly as a progress indicator and it works wonderfully! Thanks for the link, I'll give it a read.

mykeehu · 2023-10-22T18:51:26Z

In my experience 1 is the best setting, but most importantly I can tell the lora is overtrained when it starts scaling keys! So I then adjust the alpha in such a way that max_key_norm hits the ceiling (which is 1 in this case) but doesn't push further causing key scaling. I use cosine with restarts scheduling and it takes a few attempts to choose the right alpha so that max_key_norm stops growing roughly when the learning rate drops low. But then the lora comes out absolutely perfect, I've never had such great results before! Very flexible and yet perfectly captures the subject.

If I understand correctly, the max_norm/average_key_norm is the value to look at on Tensorboard, so if it reaches 1, is Lora good?

rkfg · 2023-10-22T19:16:06Z

I was talking only about max_key_norm, not average. In your case it hits 1 immediately and this feature starts scaling a lot of keys down, I try to keep those at minimum, around 1-3 tops (you get 100 and more). So while I only talked about my experience which isn't a dogma, try to lower the alpha value. If you hit 100 scaled keys this fast, divide the alpha by 100 for example or even more, so max_key_norm hits 1 at the very end of training.

mykeehu · 2023-10-23T15:49:16Z

It's cosine_with_restarts training with AdamW8bit, and here the scaling always went up. Should I find another method?

rkfg · 2023-10-23T15:52:18Z

I don't think the scheduler matters much, it's about your learning rate and Network Alpha that multiplies the LR by alpha/rank (so if your rank is 128 and alpha is 1, your LR would be 128 times less than specified). Again, my observation doesn't apply to all subjects, some come out undertrained and I need to raise the scaling ceiliing to 2 or even 3.

mykeehu · 2023-10-23T15:56:32Z

At 64/1 I get this scaling and it still looks a bit strong.

rkfg · 2023-10-23T16:00:44Z

Try below 1 then, alpha is fractional.

This all isn't a rule or a guide, just my observation. Since we don't have any meaningful indicators of the training progress, this one turned out to work best. Loss is meaningless for loras, it shows almost the same values no matter what parameters you set, and if you change the seed these values also change unpredictably and it's impossible to tell on which epoch you under/overtrained it. So this idea could be a total coincidence for me but this parameter at least depends on the total learning progress, it grows even after the scheduler restarts and it reflects the important values (LR, Alpha at least)

mykeehu · 2023-10-24T18:00:54Z

Yesterday I looked at normalization, but it seems to work fine for me only with cosine scheduler, with the others it generates huge losses for some reason. But with cosine it gives nice results under AdamW, Prodigy or Adafactor optimizer.

mykeehu · 2023-11-29T13:30:50Z

I have trained several Lora in the last weeks and I have noticed that a stable, good quality and yet flexible Lora is made with Prodigy, cosine scheduler and if max_norm/keys_scaled does not go above 2 (sometimes even 2 is strong) when training a person. For styles I usually leave it stronger, up to 6. If the max_norm/max_key_norm doesn't reach 1, it will be under-trained. For Prodigy, this is best done by changing the d_coef value.
So this is a big help in training a good Lora! Thank you!

Seedmanc · 2024-09-16T18:58:51Z

I'd like to just interject for a moment, why is it suggested to modify alpha to reduce learning rate instead of directly reducing the rate itself? Are there more factors at play? In fact I don't understand the point of this alpha parameter at all if all it does is linearly reducing the learning rate.

(edit)
So I tested this approach and it doesn't work for text encoder overfitting. I intentionally made the TE LR too high, it's clearly noticeable on resulting images, but the amount of keys scaled is the lowest of all. When the same is done for Unet, the scaled keys are off the charts but the image isn't that damaged at all.

WetOnTheWater · 2024-09-19T18:07:12Z

@AI-Casanova do you know if these dropout regularization techniques would work well on Flux?

AI-Casanova and others added 12 commits May 21, 2023 14:36

Instantiate max_norm

2763dd1

minor

c2a8a3f

Move to end of step

f21faa2

Merge branch 'kohya-ss:main' into max_norm

086e621

argparse

939c0e9

metadata

17ba21d

phrasing

89ccc52

Sqrt ratio and logging

a3824d9

fix logging

ef8f5d9

Merge branch 'kohya-ss:main' into max_norm

9dca6f4

Dropout test

07e2f83

Dropout Args

b897847

Dropout changed to affect LoRA only

bb3d4ba

kohya-ss changed the base branch from main to dev June 1, 2023 03:42

Merge branch 'dev' into max_norm

48b526d

kohya-ss merged commit 9c72371 into kohya-ss:dev Jun 1, 2023

AI-Casanova deleted the max_norm branch June 21, 2023 02:04

M4X1K02 mentioned this pull request Jul 27, 2023

Max norm adjust bmaltais/kohya_ss#1272

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dropout and Max Norm Regularization for LoRA training #545

Dropout and Max Norm Regularization for LoRA training #545

AI-Casanova commented May 29, 2023

kohya-ss commented May 29, 2023

AI-Casanova commented May 29, 2023

AI-Casanova commented May 29, 2023

kohya-ss commented May 29, 2023

IdiotSandwichTheThird commented May 30, 2023

XZiar commented May 31, 2023

kohya-ss commented Jun 1, 2023

kohya-ss commented Jun 1, 2023

AI-Casanova commented Jun 1, 2023

kohya-ss commented Jun 1, 2023

AI-Casanova commented Jun 1, 2023

TingTingin commented Jun 2, 2023 •

edited

Loading

DuroCuri commented Aug 1, 2023

feffy380 commented Sep 19, 2023

AI-Casanova commented Sep 19, 2023

rkfg commented Sep 28, 2023 •

edited

Loading

feffy380 commented Sep 28, 2023

AI-Casanova commented Sep 28, 2023 •

edited

Loading

AI-Casanova commented Sep 28, 2023

rkfg commented Sep 28, 2023

mykeehu commented Oct 22, 2023

rkfg commented Oct 22, 2023

mykeehu commented Oct 23, 2023

rkfg commented Oct 23, 2023

mykeehu commented Oct 23, 2023

rkfg commented Oct 23, 2023

mykeehu commented Oct 24, 2023

mykeehu commented Nov 29, 2023 •

edited

Loading

Seedmanc commented Sep 16, 2024 •

edited

Loading

WetOnTheWater commented Sep 19, 2024

Dropout and Max Norm Regularization for LoRA training #545

Dropout and Max Norm Regularization for LoRA training #545

Conversation

AI-Casanova commented May 29, 2023

kohya-ss commented May 29, 2023

AI-Casanova commented May 29, 2023

AI-Casanova commented May 29, 2023

kohya-ss commented May 29, 2023

IdiotSandwichTheThird commented May 30, 2023

XZiar commented May 31, 2023

kohya-ss commented Jun 1, 2023

kohya-ss commented Jun 1, 2023

AI-Casanova commented Jun 1, 2023

kohya-ss commented Jun 1, 2023

AI-Casanova commented Jun 1, 2023

TingTingin commented Jun 2, 2023 • edited Loading

DuroCuri commented Aug 1, 2023

feffy380 commented Sep 19, 2023

AI-Casanova commented Sep 19, 2023

rkfg commented Sep 28, 2023 • edited Loading

feffy380 commented Sep 28, 2023

AI-Casanova commented Sep 28, 2023 • edited Loading

AI-Casanova commented Sep 28, 2023

rkfg commented Sep 28, 2023

mykeehu commented Oct 22, 2023

rkfg commented Oct 22, 2023

mykeehu commented Oct 23, 2023

rkfg commented Oct 23, 2023

mykeehu commented Oct 23, 2023

rkfg commented Oct 23, 2023

mykeehu commented Oct 24, 2023

mykeehu commented Nov 29, 2023 • edited Loading

Seedmanc commented Sep 16, 2024 • edited Loading

WetOnTheWater commented Sep 19, 2024

TingTingin commented Jun 2, 2023 •

edited

Loading

rkfg commented Sep 28, 2023 •

edited

Loading

AI-Casanova commented Sep 28, 2023 •

edited

Loading

mykeehu commented Nov 29, 2023 •

edited

Loading

Seedmanc commented Sep 16, 2024 •

edited

Loading