Why was TreeGen not trained with the standard warm up scheduler for Transformers? #19

brando90 · 2021-08-04T15:34:55Z

Hi,

I was wondering, why was TreeGen not trained with the standard warm up scheduler (or RAdam)? It seems to be an essential piece for training most NLP transfromers so I was curious is this was tried and if not what was the process more or less for selecting the optimizer (that seems a crucial piece of the puzzle).

Thanks again! :)

brando90 · 2021-08-04T15:36:35Z

(especially because it is quite surprising that just plugging in Adafactor with default parameters and no further hyperparameter tune up was needed, since in other transfromer work it seems like an essential thing and perhaps there is something important in this detail).

brando90 · 2021-08-05T18:59:18Z

I am particular interested in:

the warm up (if used at all)
the decay/annealing (if used at up)

zysszy · 2021-08-09T08:40:39Z

We found that TreeGen trained with / without warm up scheduler achieves a very similar performance. Thus, we do not use the warm up scheduler.

In our experiment, we only use Adafactor with default parameters. We think the core contributor in TreeGen is the component we proposed. However, just as you said, maybe there exists a better optimizer that can further improve the performance of TreeGen.

Zeyu

brando90 · 2021-08-09T17:52:55Z

Hi Zeyu,

As always thanks for your responses!

What do you mean by:

We found that TreeGen trained with / without warm up scheduler achieves a very similar performance. Thus, we do not use the warm up scheduler.

Does that mean you never trained TreeGen with standard transformer optimizers e.g. Adam + warm up + decay scheduler but only used AdaFactor in your experiments? What I am most curious about right now is the optimizers you tested TreeGen and the experiments you did wrt optimizers and their settings.

(btw I think I understand that your major contributions is that TreeGen has structural priors in the architecture. e.g. TreeConv, repeated depth embeddings, path embeddings, gating for rules/chars like stuff etc. - although I think a "main contributions" bullet point(s) is always nice to have explicitly).

Thanks in advance!

zysszy · 2021-08-10T17:11:02Z

Does that mean you never trained TreeGen with standard transformer optimizers e.g. Adam + warm up + decay scheduler but only used AdaFactor in your experiments?

Yes, we only use AdaFactor to train TreeGen.

Zeyu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why was TreeGen not trained with the standard warm up scheduler for Transformers? #19

Why was TreeGen not trained with the standard warm up scheduler for Transformers? #19

brando90 commented Aug 4, 2021

brando90 commented Aug 4, 2021

brando90 commented Aug 5, 2021

zysszy commented Aug 9, 2021

brando90 commented Aug 9, 2021 •

edited

Loading

zysszy commented Aug 10, 2021 •

edited

Loading

Why was TreeGen not trained with the standard warm up scheduler for Transformers? #19

Why was TreeGen not trained with the standard warm up scheduler for Transformers? #19

Comments

brando90 commented Aug 4, 2021

brando90 commented Aug 4, 2021

brando90 commented Aug 5, 2021

zysszy commented Aug 9, 2021

brando90 commented Aug 9, 2021 • edited Loading

zysszy commented Aug 10, 2021 • edited Loading

brando90 commented Aug 9, 2021 •

edited

Loading

zysszy commented Aug 10, 2021 •

edited

Loading