Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why was TreeGen not trained with the standard warm up scheduler for Transformers? #19

Open
brando90 opened this issue Aug 4, 2021 · 5 comments

Comments

@brando90
Copy link

brando90 commented Aug 4, 2021

Hi,

I was wondering, why was TreeGen not trained with the standard warm up scheduler (or RAdam)? It seems to be an essential piece for training most NLP transfromers so I was curious is this was tried and if not what was the process more or less for selecting the optimizer (that seems a crucial piece of the puzzle).

Thanks again! :)

@brando90
Copy link
Author

brando90 commented Aug 4, 2021

(especially because it is quite surprising that just plugging in Adafactor with default parameters and no further hyperparameter tune up was needed, since in other transfromer work it seems like an essential thing and perhaps there is something important in this detail).

@brando90
Copy link
Author

brando90 commented Aug 5, 2021

I am particular interested in:

  1. the warm up (if used at all)
  2. the decay/annealing (if used at up)

@zysszy
Copy link
Owner

zysszy commented Aug 9, 2021

We found that TreeGen trained with / without warm up scheduler achieves a very similar performance. Thus, we do not use the warm up scheduler.

In our experiment, we only use Adafactor with default parameters. We think the core contributor in TreeGen is the component we proposed. However, just as you said, maybe there exists a better optimizer that can further improve the performance of TreeGen.

Zeyu

@brando90
Copy link
Author

brando90 commented Aug 9, 2021

Hi Zeyu,

As always thanks for your responses!

What do you mean by:

We found that TreeGen trained with / without warm up scheduler achieves a very similar performance. Thus, we do not use the warm up scheduler.

Does that mean you never trained TreeGen with standard transformer optimizers e.g. Adam + warm up + decay scheduler but only used AdaFactor in your experiments? What I am most curious about right now is the optimizers you tested TreeGen and the experiments you did wrt optimizers and their settings.

(btw I think I understand that your major contributions is that TreeGen has structural priors in the architecture. e.g. TreeConv, repeated depth embeddings, path embeddings, gating for rules/chars like stuff etc. - although I think a "main contributions" bullet point(s) is always nice to have explicitly).

Thanks in advance!

@zysszy
Copy link
Owner

zysszy commented Aug 10, 2021

Does that mean you never trained TreeGen with standard transformer optimizers e.g. Adam + warm up + decay scheduler but only used AdaFactor in your experiments?

Yes, we only use AdaFactor to train TreeGen.

Zeyu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants