-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option 'ALL' to include all linear layers as target modules #1295
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this addition! Great tests!
One thing here is that this API won't be generalized for all models as some architectures have different names for lm_head
. I proposed a suggestion below.
I suggest to copy over the method get_keys_to_not_convert
here in PEFT and use that method.
GPTNeoX uses embed_out
for the name of the lm head: https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/modeling_gpt_neox.py#L964 - you can use this model: https://huggingface.co/hf-internal-testing/tiny-random-GPTNeoXForCausalLM for testing
@younesbelkada Do take another look! Updates:
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work, thanks very much for working on this @SumanthRH !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR. I haven't done a detailed review, but have some comments:
- How useful do you think the feature is. Do you have use cases where all linear layers have to be replaced? Note that we could already do something like
target_modules = [name for name, module in model.named_modules() if isinstance(module, linear_classes)]
. - As a user, I could be confused by the
"ALL"
option, since it does not target all possible layers, only all linear layers. I assume you didn't want to call the option"Linear"
because that could be ambiguous if there is a module with the name"Linear"
. But we could use something like"<Linear>"
or so, since that would not be a legal name for a module, so it's unambiguous. WDYT? - IIUC, this method works by modifying the peft config and overriding the value of
target_modules
. I wonder if that is the best way. As a user, if I pass a certain config, I would generally assume that it stays as it is. Instead of overriding the config, would it be better to leave it and then, when we check if we should add the LoRA adapters to a certain layer, we change the logic to match the type instead of name.
@BenjaminBossan thanks for the review!
I think this is a popular usecase, because QLoRA-style training is popular and effective. This configuration, as of 2023 atleast, is the best one for LoRA models to match full finetuning. From the QLoRA paper: As such, there are still some experiments to be made to see if we can push this further (for example, adding LoRA to the last output layer as well), but I feel PEFT should support this since it's the best LoRA setting right now. In terms of modifications, it isn't as simple as the one-liner you provided because 1) the user needs to provide all linear classes (with
Right as always :) I was only focused on language models with Linear layers, in which case "ALL" is a good name. I'm not a fan of mixed case or having an uncommon symbol like
Hmm this is a good point. If the user passed in "all_linear", it will get secretly rewritten with a new list of target modules, and, for example, when they save the config/ model with (Also, this might end up being a feature for others to replicate later on: leaving the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a popular usecase, because QLoRA-style training is popular and effective. This configuration, as of 2023 atleast, is the best one for LoRA models to match full finetuning.
Makes sense.
In terms of modifications, it isn't as simple as the one-liner you provided
Yes, true, the snippet would have to rather be:
linear_classes = ...
target_modules = [name for name, module in model.named_modules() if isinstance(module, linear_classes) and name != <NAME-HEAD>]
Still, I'm fine with adding this convenience function.
I'm not a fan of mixed case or having an uncommon symbol like <, so I've changed it to "all_linear" now, hope that works!
Not a fan of special chars either, but the issue is still that if a model has a module with the name "all_linear", this would be ambiguous, only special chars can prevent this. How about "all-linear"? Off the top of my head, I can't think of any potential conflicts here.
Regarding whether to modify the config or not:
this change mirrors what _prepare_adapter_config which updates the peft_config directly if target_modules is None
Yes, there is precedence, but I don't like the precedence either :)
Let's think this design through. If we modify the config as proposed in this PR, these points come to mind:
- There is a bit of a surprise because the saved config is not the same as the one indicated by the user.
- The saved config transparently shows what layers were adapted.
- If we fix something in this logic (say, there is a bug with some layers matching incorrectly), a user who loads the config (with the explicit
target_modules
) will get different results from a user who uses a "fresh" config (with implicittarget_modules
). This can be bad (bug is not fixed) or good (reproducibility). - It works directly for other tuner methods like IA³ because we don't need to touch
_create_and_replace
, which is different for each tuner method.
In sum, there are pros and cons for both. I'm fine with the current approach (but let's document it).
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your statements on testing and checking for PreTrainedModel
. The PR looks pretty good. I found some minor issues still, but hopefully those can be fixed quickly. Thanks for your continued work.
PS: Optionally, having an entry in the docs (for instance here) could make this feature easier to discover.
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
@BenjaminBossan thanks for the detailed reviews! Hopefully this can be merged now! |
Following discussions in #1318, I've simplified the logic for bnb layers! Code is much cleaner now. The tests seem like overkill now, although I think the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much @SumanthRH, the PR looks pretty good, nothing to add from my side. Before merging, let's wait for @pacman100 (he's OOO this week), because of the considerable overlap with his PR.
The tests seem like overkill now, although I think the bnb tests can stay just to be sure.
They should stay for sure. This could help us in the future (e.g. if they decide that bnb layers no longer inherit from Linear
, the tests would catch it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @SumanthRH for adding all-linear
option to target the linear layers following QLoRA recipe, very useful! Also, comprehensive tests! 🚀
Description When the option to specify target_modules="all-linear" was introduced in PEFT (huggingface#1295), the restriction was added to only allow it for instances of PreTrainedModel. This was because we want to exclude the output layer from being targeted, even if it is a linear layer, and we can't determine this layer well except by convention. This PR lifts the restriction to PreTrainedModels. Thus, users can now target other models like diffusers models or custom models. The caveat is to use this "at your own risk", since all linear layers will be targeted, whether they be output layers or not. Bugfix While working on this, I found a potential bug. The logic for updating target_modules was that only the last part of the linear module's name was used. So e.g. if the module was named "foo.bar.baz", then "baz" was added to target_modules. This will lead to problems if there is another "baz" module that is not a linear layer. This bug was fixed by adding the full name ("foo.bar.baz" in this example) to the updated target_modules. This can potentially lead to big target_modules with a lot of almost repititions, but it's worth it to avoid targeting the wrong module. It is not clear to me why only the last part was added. The PR that added this to PEFT copied that part from here: https://github.com/artidoro/qlora/blob/7f4e95a68dc076bea9b3a413d2b512eca6d004e5/qlora.py#L248 but it's not clear why that repo did it that way. Maybe it was just to keep the set size smaller. The bug was uncovered by the unet test that is already present. Still, I extended this test, as well as another one, to better cover this potential issue, by ensuring that the number of target layers is as expected. Backwards compatibility Technically, this change is breaking backwards compatibility. To go back to the previous example, let's say we have a module that is called "conv.baz" and that is a Conv2d layer. With the old behavior, since "baz" is added to the target_modules, we would now also target this Conv2d layer, which is supported by LoRA. After merging this PR, the Conv2d layer would no longer be targeted. I'd argue this is the correct behavior and thus worth changing. Also, note that since we override target_modules, this is reflected in the adapter_config.json. Therefore, if a user loads an adapter that had this "baz" target, it will still work as it did previously.
Description When the option to specify target_modules="all-linear" was introduced in PEFT (#1295), the restriction was added to only allow it for instances of PreTrainedModel. This was because we want to exclude the output layer from being targeted, even if it is a linear layer, and we can't determine this layer well except by convention. This PR lifts the restriction to PreTrainedModels. Thus, users can now target other models like diffusers models or custom models. The caveat is to use this "at your own risk", since all linear layers will be targeted, whether they be output layers or not. Bugfix While working on this, I found a potential bug. The logic for updating target_modules was that only the last part of the linear module's name was used. So e.g. if the module was named "foo.bar.baz", then "baz" was added to target_modules. This will lead to problems if there is another "baz" module that is not a linear layer. This bug was fixed by adding the full name ("foo.bar.baz" in this example) to the updated target_modules. This can potentially lead to big target_modules with a lot of almost repititions, but it's worth it to avoid targeting the wrong module. It is not clear to me why only the last part was added. The PR that added this to PEFT copied that part from here: https://github.com/artidoro/qlora/blob/7f4e95a68dc076bea9b3a413d2b512eca6d004e5/qlora.py#L248 but it's not clear why that repo did it that way. Maybe it was just to keep the set size smaller. The bug was uncovered by the unet test that is already present. Still, I extended this test, as well as another one, to better cover this potential issue, by ensuring that the number of target layers is as expected. Backwards compatibility Technically, this change is breaking backwards compatibility. To go back to the previous example, let's say we have a module that is called "conv.baz" and that is a Conv2d layer. With the old behavior, since "baz" is added to the target_modules, we would now also target this Conv2d layer, which is supported by LoRA. After merging this PR, the Conv2d layer would no longer be targeted. I'd argue this is the correct behavior and thus worth changing. Also, note that since we override target_modules, this is reflected in the adapter_config.json. Therefore, if a user loads an adapter that had this "baz" target, it will still work as it did previously.
What does this PR do?
Adds a shorthand 'ALL' to include all linear/Conv1D layers as target modules. Code adapted from the QLoRA repository.
Why is this needed?
Currently, when
target_modules
isNone
, then PEFT will choose specific layers for different models based on the original paper for that adapter. For example, this is the query and value layers in the case of LoRA. However, this is far from the best configuration for LoRA. More recently, in QLoRA, it was found that adding LoRA to ALL the layers is important to match full fine-tuning performance. QLoRA-style training is very popular so I believe this feature will be helpful.Changes made
I've added a helper function in
tuners_utils.py
which will preprocess thetarget_modules
argument specified by the user. I've also added some tests intest_tuners_utils.py