Open litcoderr opened 4 years ago
The implementation of (4)-(6) are at model.py line 307-308. For the last step, it is in layers.py. The code is implemented by first getting the hidden states for all time steps, then generate the images at once.
By the paper, (4) - (6) seems to be 'Text2Gist' module which gets input of h(t-1) and i(t) from GRU. But in your code,
crnn_code = self.motion_content_rnn(motion_input, content_mean)
Thx
I have the same question, anyone solved?
I am also confused about the RNN implementation part:
There are two GRU cells defined in the code acting as the two layers in the proposed RNN model. One is a normal GRU (GRU-1) and another (GRU-2) belonging to the Text2Gist cell.
As mentioned in the paper, the GRU-1 take the concatenated sentence and noise as input and outputs i_t: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L356
the GRU-2 in Text2Gist (Equations 4-6) code is mentioned in https://github.com/yitong91/StoryGAN/issues/15#issuecomment-586750565: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L307-L308
however, the GRU-2 code takes the motion_input (or sentences) as input: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L355 which I think is inconsistent with equation (3) and (4) in the paper, where the input for GRU-2 should be i_t (the output from the GRU-1).
Also, for equation (7), the input for Filter should be i_t (output from GRU-1), while in the code, it becomes the output from GRU-2: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L366
So the above is where my confusion comes from, while I might be wrong (correct me if I do). I also wonder if the code is intended for the toy data (i-CLEVR for StoryGAN) or is another version for the "Text2Gist" part?
Any explanations would be appreciated :)
Hi. I was interested in multi-modal video generation tasks and came across your paper.
My issue is that,
I am having a hard time interpreting your code. For example, what are 'motion features' and 'content features'? There are no specifications in your paper or in your code. If you have time please add some documentation.
Where is this implementation? I might be mis-understanding. If someone understand the implementation please tell me. Thx
Overall very interesting paper. Thanks