-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why was TreeGen not trained with the standard warm up scheduler for Transformers? #19
Comments
(especially because it is quite surprising that just plugging in Adafactor with default parameters and no further hyperparameter tune up was needed, since in other transfromer work it seems like an essential thing and perhaps there is something important in this detail). |
I am particular interested in:
|
We found that TreeGen trained with / without warm up scheduler achieves a very similar performance. Thus, we do not use the warm up scheduler. In our experiment, we only use Adafactor with default parameters. We think the core contributor in TreeGen is the component we proposed. However, just as you said, maybe there exists a better optimizer that can further improve the performance of TreeGen. Zeyu |
Hi Zeyu, As always thanks for your responses! What do you mean by:
Does that mean you never trained TreeGen with standard transformer optimizers e.g. Adam + warm up + decay scheduler but only used AdaFactor in your experiments? What I am most curious about right now is the optimizers you tested TreeGen and the experiments you did wrt optimizers and their settings. (btw I think I understand that your major contributions is that TreeGen has structural priors in the architecture. e.g. TreeConv, repeated depth embeddings, path embeddings, gating for rules/chars like stuff etc. - although I think a "main contributions" bullet point(s) is always nice to have explicitly). Thanks in advance! |
Yes, we only use AdaFactor to train TreeGen. Zeyu |
Hi,
I was wondering, why was TreeGen not trained with the standard warm up scheduler (or RAdam)? It seems to be an essential piece for training most NLP transfromers so I was curious is this was tried and if not what was the process more or less for selecting the optimizer (that seems a crucial piece of the puzzle).
Thanks again! :)
The text was updated successfully, but these errors were encountered: