zimmerrol / show-attend-and-tell-keras

Keras implementation of the "Show, Attend and Tell" paper
MIT License
26 stars 12 forks source link

Same attention Values for all timesteps #1

Open sumitsinha opened 5 years ago

sumitsinha commented 5 years ago

I am getting the same attention values for each time step, is this supposed to happen?

bloupo commented 5 years ago

I have noticed that behavior as well

zimmerrol commented 5 years ago

Thanks for bringing this to my attention. I will look into this. If you have any suggestions why this is happening please feel free to share your ideas with me.

bloupo commented 5 years ago

Thanks for bringing this to my attention. I will look into this. If you have any suggestions why this is happening please feel free to share your ideas with me.

So I looked into the class ExternalAttentionRNNWrapper and I'd like to share some observations regarding the 'constant' attention over time steps: 1) Its not a code issue, attention is updated every time steps as written in your library but the updates get to ~0 very quickly making us seeing it as constant in inference mode.
2) in the step function : "additive_atn = total_x_static_prod + hw " , the first term is the 'image' after a linear transformation and it is constant over time, the second term 'hw' , output of t-1, should be the one responsible for making additive_atn (and consequently 'attention' ) time dependent, but somehow its influence is very small, leaving us in practice with a seemingly constant attention. 3) I played with this term and modified it to "additive_atn=tf.tanh( total_x_static_prod + hw ) , and it helped , attention is no longer constant over time and for a given time step the components of attention are more uniform over space. Yet it seems that after hours of training the 'constant' behavior start to resurface and my training loss , while still lower with the added tanh, plateaued.
4) the tanh idea came from https://jhui.github.io/2017/03/15/Soft-and-hard-attention/ at the Soft Attention Paragraph.

Please let me know what you think :)

Thanks !

zimmerrol commented 5 years ago

Thanks for these observations! I will run some experiments on my own and come back to you afterward.

zimmerrol commented 5 years ago

@bloupo Sorry for my late response, but I was busier than I hoped to be in the past two weeks. Something you could try out is to modify line 590 of the ExternalAttentionRNNWrapper class: Change it from

additive_atn = total_x_static_prod + hw

to

additive_atn = K.tanh(total_x_static_prod) + K.tanh(hw)

or something similar. This might prevent the step update to depend mainly/only on the time-invariant input. I will be great if you can test this.