Why was TreeGen not trained with the standard warm up scheduler for Transformers?

zysszy / TreeGen

A Tree-Based Transformer Architecture for Code Generation. (AAAI'20)

MIT License

90 stars 26 forks source link

Why was TreeGen not trained with the standard warm up scheduler for Transformers? #19

Open brando90 opened 3 years ago

brando90 commented 3 years ago

Hi,

I was wondering, why was TreeGen not trained with the standard warm up scheduler (or RAdam)? It seems to be an essential piece for training most NLP transfromers so I was curious is this was tried and if not what was the process more or less for selecting the optimizer (that seems a crucial piece of the puzzle).

Thanks again! :)

brando90 commented 3 years ago

(especially because it is quite surprising that just plugging in Adafactor with default parameters and no further hyperparameter tune up was needed, since in other transfromer work it seems like an essential thing and perhaps there is something important in this detail).

brando90 commented 3 years ago

I am particular interested in:

the warm up (if used at all)
the decay/annealing (if used at up)

zysszy commented 3 years ago

We found that TreeGen trained with / without warm up scheduler achieves a very similar performance. Thus, we do not use the warm up scheduler.

In our experiment, we only use Adafactor with default parameters. We think the core contributor in TreeGen is the component we proposed. However, just as you said, maybe there exists a better optimizer that can further improve the performance of TreeGen.

Zeyu

brando90 commented 3 years ago

Hi Zeyu,

As always thanks for your responses!

What do you mean by:

We found that TreeGen trained with / without warm up scheduler achieves a very similar performance. Thus, we do not use the warm up scheduler.

Does that mean you never trained TreeGen with standard transformer optimizers e.g. Adam + warm up + decay scheduler but only used AdaFactor in your experiments? What I am most curious about right now is the optimizers you tested TreeGen and the experiments you did wrt optimizers and their settings.

(btw I think I understand that your major contributions is that TreeGen has structural priors in the architecture. e.g. TreeConv, repeated depth embeddings, path embeddings, gating for rules/chars like stuff etc. - although I think a "main contributions" bullet point(s) is always nice to have explicitly).

Thanks in advance!

zysszy commented 3 years ago

Does that mean you never trained TreeGen with standard transformer optimizers e.g. Adam + warm up + decay scheduler but only used AdaFactor in your experiments?

Yes, we only use AdaFactor to train TreeGen.

Zeyu