-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dropout and Max Norm Regularization for LoRA training #545
Conversation
Thank you for your pull request! It is quite interesting. I seldom employ constraints for training, but I think your proposed Max Norm Regularization seems effective. Typically, constraints are applied to individual layers, but in your proposed code, the constraints are applied to the calculated values of the up and down weights of LoRA, which is interesting. I think this will work. About the dropout, I am working on implementing it as well. However, I am using a different method. In your proposal, it seems like you have specified the dropout argument for the attention in xformers. I have a little concern about this part. Because if we are applying dropout in the attention mechanism, I think the dropout is being applied to the calculation results that include the original weights of Q/K/V after LoRA is applied. As a result, if we perform dropout at this point, won't LoRA attempt to learn the weights of the original Q/K/V linear layers that were dropped out as well? This means that the effects of LoRA may be more pronounced during inference. Personally, I think it might be more common to apply dropout in the manner of Looking at the xformers source code, GPUs that can apply dropout seem to be limited as you've written. Therefore, if we're going to apply dropout in this way, it might be better to explicitly state that the conditions under which this can be used are limited. On my end, the implementation I'm working on involves applying dropout after lora_down, as mentioned before. I would appreciate it if you could confirm about the dropout as mentioned above. |
You may be correct on the issue of dropout. The main thrust of this PR was originally Max Norm, as I hadn't yet figured out that Cutlass kernel supported other GPUs. I will look into your suggestions there. I'm fairly confident that Max Norm will work as expected, as I've been experimenting in post-processing LoRA and full merged models, and have found that key-wise rescaling can help reduce the effects of overtraining on both. This Publicly available LoRA originally had a max key norm of about 6, when rescaled to 1 (center columns) the effect was definitely reduced, but its interaction with other LoRA (not shown) was improved greatly. When applied during training, a norm constraint doesn't seem as drastic, as other weights are encouraged to be explored. As a side note, in full models, Add Difference merges seem to blow up norms, leading to burning at lower and lower CFG, another motivation for keeping a LoRA layer from being overpowered. |
Max Norm looks great! I have been annoyed by over-application when applying multiple LoRA's, so I would be happy if it improves that, even if only slightly. Also, thank you very much for the examination of the dropout! I had not verified it in detail myself, so it is very helpful. I will check and merge as soon as I have time! |
Do you have any code available for rescaling max key norms on pre-trained lora? It sounds pretty interesting. |
I remmeber when applying lora, weights will be scaled with |
I think this makes sense. @AI-Casanova I would like to add this modification, please let me know if you have any suggestions. |
I'd like to modify like this: down = state_dict[downkeys[i]].to(device)
up = state_dict[upkeys[i]].to(device)
alpha = state_dict[alphakeys[i]].to(device)
dim = down.shape[0]
scale = alpha / dim
if up.shape[2:] == (1, 1) and down.shape[2:] == (1, 1):
updown = (up.squeeze(2).squeeze(2) @ down.squeeze(2).squeeze(2)).unsqueeze(2).unsqueeze(3)
elif up.shape[2:] == (3, 3) or down.shape[2:] == (3, 3):
updown = torch.nn.functional.conv2d(down.permute(1, 0, 2, 3), up).permute(1, 0, 2, 3)
else:
updown = up @ down
updown *= scale |
Ah scale the updowns before the norm, looks good to me. |
Thank you! I'm working on max norm and dropout on dev branch. |
I see rank and module dropout, very nice! Looking forward to exploring the new hyperparameter space this creates. |
Is it correct to say that max norm is similar to alpha setting but on a per key level? if so how does this interact with alpha setting |
Dropout requires the Cutlass kernel for GPUs with Capability <8 (A100, 4090 etc) this has been tested on a Colab T4 with xformers 0.0.19, no idea minimum requirements. Is this means, if I am using rtx 3090, I can not use dropout? It is depressed for me... |
Where does this recommendation come from? The linked paper suggests a typical range of 3-4 and Keras has a default of 2. |
@feffy380 limited empirical testing on SD 1.5. I would call it more of a starting value for grid search than a suggested value. Too low and all of your keys are constrained, too high and it has no effect. |
In my experience 1 is the best setting, but most importantly I can tell the lora is overtrained when it starts scaling keys! So I then adjust the alpha in such a way that max_key_norm hits the ceiling (which is 1 in this case) but doesn't push further causing key scaling. I use cosine with restarts scheduling and it takes a few attempts to choose the right alpha so that max_key_norm stops growing roughly when the learning rate drops low. But then the lora comes out absolutely perfect, I've never had such great results before! Very flexible and yet perfectly captures the subject. If I keep it running with multiple keys getting scaled, the results are not as burned as without max norm regularization but still it's pretty obvious the model is overtrained. It captures the noise from the original images, doesn't follow the prompt just as good and has other defects. Totally better than without this regularization, but it's even better if it hits the sweet spot. I now think this metric is way better than loss which is useless for loras anyway (too statistically insignificant, basically shows the ability to predict one noise step on one random image from the dataset, it's nothing). But max_key_norm in tensorboard directly shows how well trained the model is. Maybe I'm overselling it but really I just trained on two datasets and I've never had better results before while I was doing it for months. |
@rkfg If you're adjusting settings so it never scales keys, max norm regularization isn't doing anything since scaling keys is its whole thing |
@feffy380 he's using it as a gauge, which also can be done by setting a higher norm, and I believe tensorboard should be collecting max norm data, but it's been a while since I set it up. (edit: I reread his message and yes, max norm is recorded if any max norm value is set) Being aware of the max norm is just as valid, if not perhaps more so than forcing it down via my regularization |
@rkfg there's some interesting discussion about loss over here if you're interested in the nitty gritty #294 (reply in thread) |
@AI-Casanova yes, I use it exactly as a progress indicator and it works wonderfully! Thanks for the link, I'll give it a read. |
I was talking only about |
It's cosine_with_restarts training with AdamW8bit, and here the scaling always went up. Should I find another method? |
I don't think the scheduler matters much, it's about your learning rate and Network Alpha that multiplies the LR by alpha/rank (so if your rank is 128 and alpha is 1, your LR would be 128 times less than specified). Again, my observation doesn't apply to all subjects, some come out undertrained and I need to raise the scaling ceiliing to 2 or even 3. |
Try below 1 then, alpha is fractional. This all isn't a rule or a guide, just my observation. Since we don't have any meaningful indicators of the training progress, this one turned out to work best. Loss is meaningless for loras, it shows almost the same values no matter what parameters you set, and if you change the seed these values also change unpredictably and it's impossible to tell on which epoch you under/overtrained it. So this idea could be a total coincidence for me but this parameter at least depends on the total learning progress, it grows even after the scheduler restarts and it reflects the important values (LR, Alpha at least) |
Yesterday I looked at normalization, but it seems to work fine for me only with cosine scheduler, with the others it generates huge losses for some reason. But with cosine it gives nice results under AdamW, Prodigy or Adafactor optimizer. |
I have trained several Lora in the last weeks and I have noticed that a stable, good quality and yet flexible Lora is made with Prodigy, cosine scheduler and if max_norm/keys_scaled does not go above 2 (sometimes even 2 is strong) when training a person. For styles I usually leave it stronger, up to 6. If the max_norm/max_key_norm doesn't reach 1, it will be under-trained. For Prodigy, this is best done by changing the d_coef value. |
I'd like to just interject for a moment, why is it suggested to modify alpha to reduce learning rate instead of directly reducing the rate itself? Are there more factors at play? In fact I don't understand the point of this alpha parameter at all if all it does is linearly reducing the learning rate. (edit) |
@AI-Casanova do you know if these dropout regularization techniques would work well on Flux? |
This PR adds Dropout and Max Norm Regularization [Paper] to
train_network.py
Dropout randomly removes some weights/neurons from calculation on both the forward and backward passes, effectively training many neural nets successively like so:
This encourages the LoRA to diversify its training, instead of only picking a few weights to continuously update, hopefully reducing overtraining.
Max Norm Regularization calculates the L2 norm of the weights at each key and if they exceed the cutoff, scales the entire key by a factor to bring them in line, (mentioned in section 5.1 of the paper) This works because the relationships between weights in a layer seem to be more important than the total magnitude.
When enabled, adds logging for TensorBoard, and an average norm value and number of keys scaled each step to the progress bar.
Either option can be used independently:
Example of training with dropout [0.5,0.25,0.10.05,0] and Max Norm 1 all other settings deterministic
Notes for @kohya-ss
Dropout requires Xformers, and I didn't know how you wanted to do the assertion for that
Dropout requires the Cutlass kernel for GPUs with Capability <8 (A100, 4090 etc) this has been tested on a Colab T4 with xformers 0.0.19, no idea minimum requirements.
I believe I passed dropout in a way that won't interfere with the other trainers
(dropout=None)
in the function call, but should be checked.Also I'm currently scaling lora_up and lora_down by
ratio**0.5
as a scalar should be commutative when multiplied to a matrix multiplication (iematmul(r*A, B) = matmul(A, B*r), matmul(sqrt(r)*A, sqrt(r)*B)
) , will do further testing to confirm whether to remain this way or only multiply up or down by the full ratio.