Open sumitsinha opened 5 years ago
I have noticed that behavior as well
Thanks for bringing this to my attention. I will look into this. If you have any suggestions why this is happening please feel free to share your ideas with me.
Thanks for bringing this to my attention. I will look into this. If you have any suggestions why this is happening please feel free to share your ideas with me.
So I looked into the class ExternalAttentionRNNWrapper and I'd like to share some observations regarding the 'constant' attention over time steps:
1) Its not a code issue, attention is updated every time steps as written in your library but the updates get to ~0 very quickly making us seeing it as constant in inference mode.
2) in the step function : "additive_atn = total_x_static_prod + hw " , the first term is the 'image' after a linear transformation and it is constant over time, the second term 'hw' , output of t-1, should be the one responsible for making additive_atn (and consequently 'attention' ) time dependent, but somehow its influence is very small, leaving us in practice with a seemingly constant attention.
3) I played with this term and modified it to "additive_atn=tf.tanh( total_x_static_prod + hw ) , and it helped , attention is no longer constant over time and for a given time step the components of attention are more uniform over space. Yet it seems that after hours of training the 'constant' behavior start to resurface and my training loss , while still lower with the added tanh, plateaued.
4) the tanh idea came from https://jhui.github.io/2017/03/15/Soft-and-hard-attention/ at the Soft Attention Paragraph.
Please let me know what you think :)
Thanks !
Thanks for these observations! I will run some experiments on my own and come back to you afterward.
@bloupo Sorry for my late response, but I was busier than I hoped to be in the past two weeks. Something you could try out is to modify line 590 of the ExternalAttentionRNNWrapper
class: Change it from
additive_atn = total_x_static_prod + hw
to
additive_atn = K.tanh(total_x_static_prod) + K.tanh(hw)
or something similar. This might prevent the step update to depend mainly/only on the time-invariant input. I will be great if you can test this.
I am getting the same attention values for each time step, is this supposed to happen?