Experiment using state of the art activation functions

LifeIsStrange commented 5 years ago

State of the art activation functions Are benchmarcked on this paper and show that they can have a big impact on precision and recall of a neural net. BERT moved from ReLU to GELU but GELU is not the end of progress.

This paper introduce a new activation function (first of it's kind) named Swish-1 and Swish. It would be a low hanging fruit, interesting experiment to try the two variants on XLnet and see if it improve the state of the art.

Other new activation functions could be tried as show this other paper

LifeIsStrange commented 5 years ago

It is in the same line of thought as: https://github.com/zihangdai/xlnet/issues/216

LifeIsStrange commented 5 years ago

Actually, Mish seems even more interesting to try first than Swish: The experiments show that Mish tends to work better than both ReLU and Swish along with other standard activation functions in many deep networks across challenging datasets. For instance, in Squeeze Excite Net- 18 for CIFAR 100 classification, the network with Mish had an increase in Top-1 test accuracy by 0.494% and 1.671% as compared to the same network with Swish and ReLU respectively. The similarity to Swish along with providing a boost in performance and its simplicity in implementation makes it easier for researchers and developers to use Mish in their Neural Network Models.

https://arxiv.org/abs/1908.08681v1

zihangdai / xlnet

Experiment using state of the art activation functions #219