-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Regarding potential bottleneck during training of Classifiers #36
Comments
Hi @JP-SystemsX, sorry, this part could use some more detailed comments. I am happy someone questioned this though, because the more people who do this, the better the quality of the implementations will be ensured. 1. Is this intended? Yes, it is! This is used to set different learning rates and gradient decay multipliers for different parameter groups (where one group is a "layer" in this specific small-text example). This reportedly helps against catastrophic forgetting, and moreover, it is part of the ULMFiT method. I first learned about this from "How to Fine-Tune BERT for Text Classification?", but if I remember correctly, I found such a separation by parameter groups in one of the first BERT implementations as well. As soon as I remember where, I will add this here. 2. What happens here? And Why? The function Next, you can see how the optimizer in constructed for the 3. Where is the classifier head? The actual model is referenced by the classifier.model attribute. For BERT, this is a BertForSequenceClassification object, where in turn you can find the classifier head. 4. If so can the training of the encoder be disabled to speed up the training process ? Definitely possible. Not sure how convenient it is with the current interfaces. If that is reasonable depends on your goal. Do you just want to minimize the training time? Currently, there a lot of methods for parameter-efficient fine-tuning around, they might be what you are looking for. Admittedly, I haven't had the time yet to use them for active learning, but this would be interesting to me as well. There is, however, an excellent paper on this topic by @josipjukic. I am definitely interested in hearing more about your progress here. Also, if you find any interfaces, that are difficult to extend (and there are, I also have some on the radar already), please let me know. |
Hi @chschroeder, If I should be right with that assumption, then I can see how all parameters are provided in the else-clause of the code you marked but if the condition is met, i.e. fine-tuning arguments are provided, it seems to only include the parameters of the encoder and completely neglect the classifier, as can be seen here. |
You are completely right. I think this might be a bug. This only occurs if fine-tuning arguments are set, as you correctly stated. I think the way the parameter dict is currently assembled is not optimal, and if I just add the missing parameters now it would be prone to similar bugs (for other models). Maybe we should collect all parameters first, and then try to override the "layer-specific" parameters. What do you think? |
Sure, that sounds like a good idea. |
Yes, I agree on this. What I have so far, is a mechanism that checks all trainable parameters versus our constructed set of parameters. If they are not the same size, we can raise an error and at least prevent such a situation from happening again (e.g., with incompatible models).
With the above mechanism, I think one possibility is to just collect all parameters, and override those which we recognize as belonging to a specific layer. Alternatively, we can collect the layer-specific parameters first, and then add all others. Moreover, it has to be compatible with the existing gradual unfreezing mechanism, which has to be changed to work with this. |
Sounds like a promising approach, as I am fairly new to pytorch I would be definitely interested in seeing how this works out in code. The only objection that I could come up with is the following: If I understand your approach correctly that would treat everything of the remaining stuff the same, while I was inspecting the RoBERTa model I found that this would include the token-embedding, the classifier, and maybe a pooler. I am not sure whether that will pose a problem, but I imagine that it might be at least a bit weird to freeze the lower encoder layers and keep the embedding unfrozen. Alternatively:
This would have the nice side effect that one could freeze the entire encoder and only train the classifier to speed up the training process to decrease response-time if it shall be used in combination with a real user. The same could one do for embedding and pooler like
To be fair this is probably not the most elegant way to do this and it is based on the assumption that the naming conventions are constant between the Huggingface Classifier Models what doesn't need to be the case. But I would definitely include your ideas as well, and that not only because I want to see how this is done, even though I do, but because it might be a good way to avoid unexpected errors like when an unexpected layer appears. |
You can obtain all trainable parameters via
Treating everything the same is the only thing I can think of that could support a wider range of models, exactly because you don't know how they named their embedding module, or if there are any other modules beyond this.
Freezing a layer ist best done before
Sure, this also works, but this is similarly to the way it is done now. The weakness here is, if my classifier head has a different variable name than 'classifier', it will be missed again. Either way, with the new check, this would at least raise an error. But if the more general approach works, I would be in favor of that. |
How urgent is this for you? I have already started a fix, but I will not have much time during the next few days, so I cannot promise anything right now. Might be within the week, otherwise within 1-2 weeks likely. If you are interested in contributing, you could of course try to provide a PR yourself. Afterwards, I will likely make a bugfix release (v1.3.1). |
No, don't worry, it's not urgent at all for me, |
_get_layer_params function didn't include all necessary Parameters and still didn't throw exception
I just made a PR, please check before merging, as said, I am fairly new to Pytorch. While doing so i noted a small bug in the calculation of the learning rate as well, i.e. it was sometimes raised to a negative number which made the lr explode. Which I tried to fix as well. All was done under the assumption that a more general approach would do either of the following 2:
|
_get_layer_params function didn't include all necessary Parameters and still didn't throw exception
Thanks for the new PR 👍. I am currently in the process of reviewing and have already learned and realized a lot.
I wonder if this feature should have been included at all, but for now I will stick with the original idea that we are able to use this for discriminative learning rates/gradual unfreezing on the most common transformer models. |
Done. This is more tricky than I thought, I am sorry. This is a difficult first PR ;). I have suggested a slightly different strategy than before. This one basically assumes the following structure of a transformer class:
These layers are then flattened in the following way:
I think this is the most flexible we can do. Let me know what you think. |
Ok I thought a bit more about it and I Found those 4 Alternatives:
Please Tell me which Approach you prefer |
Do you mean the version before or after your PR? I found the version from your PR quite good in the end. Did you see my latest comment in the PR (*)? I would prefer this for now. This fixes the functionality "as is", and then we can think about deprecating or replacing it.
Thought about this as well, but even then: what is a layer? If we go by the autograd graph, we might be flexible, but still not get the results the user intends to. Or am I wrong here? I think the most correct option would be to configuration mechanism (pass it from "Outside"), but then again, this is disgusting to set up ;). Edit: Do you agree here? If so, I will merge the PR. (*): My favored solution would be to merge your PR, i.e. the first of your presented options, but with your fixes. |
So sorry, Yesterday was a busy day so I forgot to check. So yeah I agree, the hook would be only a complicated fix for the rare problem that might arise when using the named_children method one would still need to "manually" open the encoder via name. Touche: I forgot the option of pass it from the Outside ;) But yeah I am fine with merging no further changes where made since the PR. I would have made some if another approach would have been desired. But when you are alright with the current version. I will have no objection to the merge. |
No worries! Thanks for the update and for your opinion on this. Unfortunately, I'm busy today, but I think I'll get the rest done within the next two days, and then we will have a bugfix release soon :). I will keep you updated. |
Now, while adding the tests, I am thinking if we should be stricter about asserting the existence of I would like to require at least Edit: By the way, do you want to be listed in the list of contributors? |
… fine-tuning arguments bugfix (#36) Signed-off-by: Christopher Schröder <chschroeder@users.noreply.github.com>
Done. I cleaned up a little, extracted the function to a util, wrote tests, and added documentation and a changelog entry. The logic is otherwise exactly as you built it. Thank you for the great work, pleasant communication, and helpful discussions! |
I am currently trying to optimize an AL pipeline using small-text therefore I am looking into the source code and came across something slightly bizarre looking, namely in the
\integrations\transformers\classifiers\classification.py
file in the_get_layer_params()
function the following code gets executed to get the parameters to train:If I understand correctly that implies that instead of training the head of the (in my case) RobertaForSequenceClassification (RSC) model the whole encoder is trained and the head neglected, which would be far more computational expensive and quite a unique approach to the problem.
Therefore, the following questions arise for me:
The text was updated successfully, but these errors were encountered: