Closed KyonP closed 1 year ago
@KyonP yeah, this line provides all images to the model during training to perform teacher forcing
Oh, I see. 😄
So, when generating the 3rd image from the story sequence, ARLDM is given both 1st and 2nd raw (true) images? (1st to 3rd raw images to generate 4th image?)
Assuming from your mentioned "teacher forcing," using the 1st and 2nd raw images to properly generate 2nd image, which is given to synthesize 3rd image.
Is my understanding correct?
@KyonP Yeah, exactly!
While the second point can be different. We using the 1st and 2nd raw images to properly generate 3nd image.
You can refer to the Section 3.2 and Figure 2.a of our paper. https://arxiv.org/abs/2211.10950
Thanks. 😅
So, during the generation of the 3rd image (in Figure 2.A), the 2nd raw image is given to the auto-regressive process (long arrow on the right-hand side) to force teaching?
Can you give me a link to where the teacher forcing occurring? Maybe within Unet?
BTW, thank you for your speedy reply, I didn't expect it 😄
@KyonP Yes. And the teacher forcing is implemented through attention mask. https://github.com/xichenpan/ARLDM/blob/5b03fc4cf78d6509620506a6ca1bd799d6bd9ad4/main.py#L218-L221 Which is passed into Unet through: https://github.com/xichenpan/ARLDM/blob/5b03fc4cf78d6509620506a6ca1bd799d6bd9ad4/main.py#L231
thanks, I will look into it! 👍
I was trying to join my personal model with your code, and I wondered if the current dataset code for Pororo is working correctly.
I haven't looked into the code and paper thoroughly, especially the modified Unet, so I think I rather ask you a question. 😅
As in this line,
source_images
gets all the images (the first to the fifth image), after that, it is given sequentially by masking in main.py..square_mask is a matrix filled with 1 in a triangle. I assume this eventually gives all the encoded source images and source captions to Unet.
Am I correct?