Open huaweiboy opened 4 years ago
I have question, in paper , you use PRelu/Dice as your attention activation function,but in your code you use sigmoid as your activation function, why do that?
This implementation is a demo on Amazon data. Activation function for attention weight is neither the contribution of this paper nor important.
I have question, in paper , you use PRelu/Dice as your attention activation function,but in your code you use sigmoid as your activation function, why do that?