Issue in implementation

litcoderr commented 4 years ago

Hi. I was interested in multi-modal video generation tasks and came across your paper.

My issue is that,

I was confused with your code implementation and your model description based on paper.

You have 'Story Encoder' in your paper which appears to be a VAE Module. But you are not using them in your code. link to code snippet
```
m_code, m_mu, m_logvar = motion_input, motion_input, motion_input #self.ca_net(motion_input)
```

I am having a hard time interpreting your code. For example, what are 'motion features' and 'content features'? There are no specifications in your paper or in your code. If you have time please add some documentation.
Where is this implementation? I might be mis-understanding. If someone understand the implementation please tell me. Thx

Overall very interesting paper. Thanks

yitong91 commented 4 years ago

The implementation of (4)-(6) are at model.py line 307-308. For the last step, it is in layers.py. The code is implemented by first getting the hidden states for all time steps, then generate the images at once.

litcoderr commented 4 years ago

By the paper, (4) - (6) seems to be 'Text2Gist' module which gets input of h(t-1) and i(t) from GRU. But in your code,

crnn_code = self.motion_content_rnn(motion_input, content_mean)

'motion_content_rnn' which you referred as a corresponding code for equation (4)-(6), does not take in a GRU ouput of motion_input but a raw motion_input.
Also takes in a mean tensor of content, not an output from 'StoryEncoder'(VAE)

Thx

awkrail commented 4 years ago

I have the same question, anyone solved?

hopeisme commented 3 years ago

I am also confused about the RNN implementation part:

There are two GRU cells defined in the code acting as the two layers in the proposed RNN model. One is a normal GRU (GRU-1) and another (GRU-2) belonging to the Text2Gist cell.

As mentioned in the paper, the GRU-1 take the concatenated sentence and noise as input and outputs i_t: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L356

the GRU-2 in Text2Gist (Equations 4-6) code is mentioned in https://github.com/yitong91/StoryGAN/issues/15#issuecomment-586750565: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L307-L308

however, the GRU-2 code takes the motion_input (or sentences) as input: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L355 which I think is inconsistent with equation (3) and (4) in the paper, where the input for GRU-2 should be i_t (the output from the GRU-1).

Also, for equation (7), the input for Filter should be i_t (output from GRU-1), while in the code, it becomes the output from GRU-2: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L366

So the above is where my confusion comes from, while I might be wrong (correct me if I do). I also wonder if the code is intended for the toy data (i-CLEVR for StoryGAN) or is another version for the "Text2Gist" part?

Any explanations would be appreciated :)

yitong91 / StoryGAN

Issue in implementation #15