Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dropout and Max Norm Regularization for LoRA training #545

Merged
merged 14 commits into from
Jun 1, 2023

Conversation

AI-Casanova
Copy link
Contributor

This PR adds Dropout and Max Norm Regularization [Paper] to train_network.py

Dropout randomly removes some weights/neurons from calculation on both the forward and backward passes, effectively training many neural nets successively like so:
image
This encourages the LoRA to diversify its training, instead of only picking a few weights to continuously update, hopefully reducing overtraining.

Max Norm Regularization calculates the L2 norm of the weights at each key and if they exceed the cutoff, scales the entire key by a factor to bring them in line, (mentioned in section 5.1 of the paper) This works because the relationships between weights in a layer seem to be more important than the total magnitude.
When enabled, adds logging for TensorBoard, and an average norm value and number of keys scaled each step to the progress bar.

Either option can be used independently:

  • Dropout suggested setting >0.3
  • Max Norm suggested setting = 1 (You can also set it high enough to never trigger ie 10 to watch Tensor Board and see where a good point to set it at might be)

Example of training with dropout [0.5,0.25,0.10.05,0] and Max Norm 1 all other settings deterministic
dropout

Notes for @kohya-ss
Dropout requires Xformers, and I didn't know how you wanted to do the assertion for that
Dropout requires the Cutlass kernel for GPUs with Capability <8 (A100, 4090 etc) this has been tested on a Colab T4 with xformers 0.0.19, no idea minimum requirements.
I believe I passed dropout in a way that won't interfere with the other trainers (dropout=None) in the function call, but should be checked.
Also I'm currently scaling lora_up and lora_down by ratio**0.5 as a scalar should be commutative when multiplied to a matrix multiplication (ie matmul(r*A, B) = matmul(A, B*r), matmul(sqrt(r)*A, sqrt(r)*B)) , will do further testing to confirm whether to remain this way or only multiply up or down by the full ratio.

@kohya-ss
Copy link
Owner

Thank you for your pull request! It is quite interesting.

I seldom employ constraints for training, but I think your proposed Max Norm Regularization seems effective. Typically, constraints are applied to individual layers, but in your proposed code, the constraints are applied to the calculated values of the up and down weights of LoRA, which is interesting. I think this will work.

About the dropout, I am working on implementing it as well. However, I am using a different method.

In your proposal, it seems like you have specified the dropout argument for the attention in xformers. I have a little concern about this part.

Because if we are applying dropout in the attention mechanism, I think the dropout is being applied to the calculation results that include the original weights of Q/K/V after LoRA is applied. As a result, if we perform dropout at this point, won't LoRA attempt to learn the weights of the original Q/K/V linear layers that were dropped out as well? This means that the effects of LoRA may be more pronounced during inference.

Personally, I think it might be more common to apply dropout in the manner of x = lora_up(dropout(lora_down(x))).

Looking at the xformers source code, GPUs that can apply dropout seem to be limited as you've written. Therefore, if we're going to apply dropout in this way, it might be better to explicitly state that the conditions under which this can be used are limited.

On my end, the implementation I'm working on involves applying dropout after lora_down, as mentioned before.

I would appreciate it if you could confirm about the dropout as mentioned above.

@AI-Casanova
Copy link
Contributor Author

You may be correct on the issue of dropout. The main thrust of this PR was originally Max Norm, as I hadn't yet figured out that Cutlass kernel supported other GPUs. I will look into your suggestions there.

I'm fairly confident that Max Norm will work as expected, as I've been experimenting in post-processing LoRA and full merged models, and have found that key-wise rescaling can help reduce the effects of overtraining on both.

This Publicly available LoRA originally had a max key norm of about 6, when rescaled to 1 (center columns) the effect was definitely reduced, but its interaction with other LoRA (not shown) was improved greatly. When applied during training, a norm constraint doesn't seem as drastic, as other weights are encouraged to be explored.
norm

As a side note, in full models, Add Difference merges seem to blow up norms, leading to burning at lower and lower CFG, another motivation for keeping a LoRA layer from being overpowered.

@AI-Casanova
Copy link
Contributor Author

You are absolutely correct

return self.org_forward(x) + self.lora_up(torch.nn.functional.dropout(self.lora_down(x),p=0.50)) * self.multiplier * self.scale

is absolutely the way to go.

dropout2

p=[0.50,0.25,0.10,0.05]
alt is your method
Same settings/seed for everything else.

@kohya-ss
Copy link
Owner

Max Norm looks great! I have been annoyed by over-application when applying multiple LoRA's, so I would be happy if it improves that, even if only slightly.

Also, thank you very much for the examination of the dropout! I had not verified it in detail myself, so it is very helpful.
I am considering how to apply dropout to the network carefully (I want to take into account the impact on LyCORIS, etc.), and I am considering specifying it in --network_args.
So I may change the dropout part after the merging.

I will check and merge as soon as I have time!

@IdiotSandwichTheThird
Copy link

This Publicly available LoRA originally had a max key norm of about 6, when rescaled to 1 (center columns) the effect was definitely reduced, but its interaction with other LoRA (not shown) was improved greatly.

Do you have any code available for rescaling max key norms on pre-trained lora? It sounds pretty interesting.

@XZiar
Copy link

XZiar commented May 31, 2023

I remmeber when applying lora, weights will be scaled with alpha/dim, so maybe the maxnorm should also consider this?

@kohya-ss kohya-ss changed the base branch from main to dev June 1, 2023 03:42
@kohya-ss kohya-ss merged commit 9c72371 into kohya-ss:dev Jun 1, 2023
@kohya-ss
Copy link
Owner

kohya-ss commented Jun 1, 2023

I remmeber when applying lora, weights will be scaled with alpha/dim, so maybe the maxnorm should also consider this?

I think this makes sense.

@AI-Casanova I would like to add this modification, please let me know if you have any suggestions.

@kohya-ss
Copy link
Owner

kohya-ss commented Jun 1, 2023

I'd like to modify like this:

        down = state_dict[downkeys[i]].to(device)
        up = state_dict[upkeys[i]].to(device)
        alpha = state_dict[alphakeys[i]].to(device)
        dim = down.shape[0]
        scale = alpha / dim

        if up.shape[2:] == (1, 1) and down.shape[2:] == (1, 1):
            updown = (up.squeeze(2).squeeze(2) @ down.squeeze(2).squeeze(2)).unsqueeze(2).unsqueeze(3)
        elif up.shape[2:] == (3, 3) or down.shape[2:] == (3, 3):
            updown = torch.nn.functional.conv2d(down.permute(1, 0, 2, 3), up).permute(1, 0, 2, 3)
        else:
            updown = up @ down

        updown *= scale

@AI-Casanova
Copy link
Contributor Author

Ah scale the updowns before the norm, looks good to me.

@kohya-ss
Copy link
Owner

kohya-ss commented Jun 1, 2023

Thank you! I'm working on max norm and dropout on dev branch.

@AI-Casanova
Copy link
Contributor Author

I see rank and module dropout, very nice! Looking forward to exploring the new hyperparameter space this creates.

@TingTingin
Copy link
Contributor

TingTingin commented Jun 2, 2023

Is it correct to say that max norm is similar to alpha setting but on a per key level? if so how does this interact with alpha setting

@AI-Casanova AI-Casanova deleted the max_norm branch June 21, 2023 02:04
@DuroCuri
Copy link

DuroCuri commented Aug 1, 2023

Dropout requires the Cutlass kernel for GPUs with Capability <8 (A100, 4090 etc) this has been tested on a Colab T4 with xformers 0.0.19, no idea minimum requirements. Is this means, if I am using rtx 3090, I can not use dropout? It is depressed for me...

@feffy380
Copy link
Contributor

  • Max Norm suggested setting = 1

Where does this recommendation come from? The linked paper suggests a typical range of 3-4 and Keras has a default of 2.

@AI-Casanova
Copy link
Contributor Author

@feffy380 limited empirical testing on SD 1.5.

I would call it more of a starting value for grid search than a suggested value.

Too low and all of your keys are constrained, too high and it has no effect.

@rkfg
Copy link

rkfg commented Sep 28, 2023

In my experience 1 is the best setting, but most importantly I can tell the lora is overtrained when it starts scaling keys! So I then adjust the alpha in such a way that max_key_norm hits the ceiling (which is 1 in this case) but doesn't push further causing key scaling. I use cosine with restarts scheduling and it takes a few attempts to choose the right alpha so that max_key_norm stops growing roughly when the learning rate drops low. But then the lora comes out absolutely perfect, I've never had such great results before! Very flexible and yet perfectly captures the subject.

If I keep it running with multiple keys getting scaled, the results are not as burned as without max norm regularization but still it's pretty obvious the model is overtrained. It captures the noise from the original images, doesn't follow the prompt just as good and has other defects. Totally better than without this regularization, but it's even better if it hits the sweet spot.

I now think this metric is way better than loss which is useless for loras anyway (too statistically insignificant, basically shows the ability to predict one noise step on one random image from the dataset, it's nothing). But max_key_norm in tensorboard directly shows how well trained the model is. Maybe I'm overselling it but really I just trained on two datasets and I've never had better results before while I was doing it for months.

@feffy380
Copy link
Contributor

@rkfg If you're adjusting settings so it never scales keys, max norm regularization isn't doing anything since scaling keys is its whole thing

@AI-Casanova
Copy link
Contributor Author

AI-Casanova commented Sep 28, 2023

@feffy380 he's using it as a gauge, which also can be done by setting a higher norm, and I believe tensorboard should be collecting max norm data, but it's been a while since I set it up. (edit: I reread his message and yes, max norm is recorded if any max norm value is set)

Being aware of the max norm is just as valid, if not perhaps more so than forcing it down via my regularization

@AI-Casanova
Copy link
Contributor Author

@rkfg there's some interesting discussion about loss over here if you're interested in the nitty gritty #294 (reply in thread)

@rkfg
Copy link

rkfg commented Sep 28, 2023

@AI-Casanova yes, I use it exactly as a progress indicator and it works wonderfully! Thanks for the link, I'll give it a read.

@mykeehu
Copy link

mykeehu commented Oct 22, 2023

In my experience 1 is the best setting, but most importantly I can tell the lora is overtrained when it starts scaling keys! So I then adjust the alpha in such a way that max_key_norm hits the ceiling (which is 1 in this case) but doesn't push further causing key scaling. I use cosine with restarts scheduling and it takes a few attempts to choose the right alpha so that max_key_norm stops growing roughly when the learning rate drops low. But then the lora comes out absolutely perfect, I've never had such great results before! Very flexible and yet perfectly captures the subject.

If I understand correctly, the max_norm/average_key_norm is the value to look at on Tensorboard, so if it reaches 1, is Lora good?
image

@rkfg
Copy link

rkfg commented Oct 22, 2023

I was talking only about max_key_norm, not average. In your case it hits 1 immediately and this feature starts scaling a lot of keys down, I try to keep those at minimum, around 1-3 tops (you get 100 and more). So while I only talked about my experience which isn't a dogma, try to lower the alpha value. If you hit 100 scaled keys this fast, divide the alpha by 100 for example or even more, so max_key_norm hits 1 at the very end of training.

@mykeehu
Copy link

mykeehu commented Oct 23, 2023

It's cosine_with_restarts training with AdamW8bit, and here the scaling always went up. Should I find another method?

@rkfg
Copy link

rkfg commented Oct 23, 2023

I don't think the scheduler matters much, it's about your learning rate and Network Alpha that multiplies the LR by alpha/rank (so if your rank is 128 and alpha is 1, your LR would be 128 times less than specified). Again, my observation doesn't apply to all subjects, some come out undertrained and I need to raise the scaling ceiliing to 2 or even 3.

@mykeehu
Copy link

mykeehu commented Oct 23, 2023

At 64/1 I get this scaling and it still looks a bit strong.
image

@rkfg
Copy link

rkfg commented Oct 23, 2023

Try below 1 then, alpha is fractional.

This all isn't a rule or a guide, just my observation. Since we don't have any meaningful indicators of the training progress, this one turned out to work best. Loss is meaningless for loras, it shows almost the same values no matter what parameters you set, and if you change the seed these values also change unpredictably and it's impossible to tell on which epoch you under/overtrained it. So this idea could be a total coincidence for me but this parameter at least depends on the total learning progress, it grows even after the scheduler restarts and it reflects the important values (LR, Alpha at least)

@mykeehu
Copy link

mykeehu commented Oct 24, 2023

Yesterday I looked at normalization, but it seems to work fine for me only with cosine scheduler, with the others it generates huge losses for some reason. But with cosine it gives nice results under AdamW, Prodigy or Adafactor optimizer.

@mykeehu
Copy link

mykeehu commented Nov 29, 2023

I have trained several Lora in the last weeks and I have noticed that a stable, good quality and yet flexible Lora is made with Prodigy, cosine scheduler and if max_norm/keys_scaled does not go above 2 (sometimes even 2 is strong) when training a person. For styles I usually leave it stronger, up to 6. If the max_norm/max_key_norm doesn't reach 1, it will be under-trained. For Prodigy, this is best done by changing the d_coef value.
So this is a big help in training a good Lora! Thank you!

@Seedmanc
Copy link

Seedmanc commented Sep 16, 2024

I'd like to just interject for a moment, why is it suggested to modify alpha to reduce learning rate instead of directly reducing the rate itself? Are there more factors at play? In fact I don't understand the point of this alpha parameter at all if all it does is linearly reducing the learning rate.

(edit)
So I tested this approach and it doesn't work for text encoder overfitting. I intentionally made the TE LR too high, it's clearly noticeable on resulting images, but the amount of keys scaled is the lowest of all. When the same is done for Unet, the scaled keys are off the charts but the image isn't that damaged at all.

@WetOnTheWater
Copy link

@AI-Casanova do you know if these dropout regularization techniques would work well on Flux?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.