[image captioning] model picture

Hi,

In your picture here the output of the LSTM at the 1st timestep (when the input is the image feature vector) is "\<start>", which is then fed back into the LSTM at the 2nd timestep. However, I don't think you actually train your LSTM to output the "\<start>" token when inputting the image features, right?

So a more correct image would be something like this: image. This is also more similar to the figure at page 4 in the Show & Tell paper by Vinyals et al. (link).

Unless I'm mistaken of course :). Cheers!

yunjey / pytorch-tutorial

[image captioning] model picture #177