xichenpan / ARLDM

Official Pytorch Implementation of Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models
https://arxiv.org/abs/2211.10950
MIT License
182 stars 28 forks source link

source images contain not only the first image? #14

Closed KyonP closed 1 year ago

KyonP commented 1 year ago

I was trying to join my personal model with your code, and I wondered if the current dataset code for Pororo is working correctly.

I haven't looked into the code and paper thoroughly, especially the modified Unet, so I think I rather ask you a question. 😅

As in this line, source_images gets all the images (the first to the fifth image), after that, it is given sequentially by masking in main.py..

square_mask is a matrix filled with 1 in a triangle. I assume this eventually gives all the encoded source images and source captions to Unet.

Am I correct?

xichenpan commented 1 year ago

@KyonP yeah, this line provides all images to the model during training to perform teacher forcing

KyonP commented 1 year ago

Oh, I see. 😄

So, when generating the 3rd image from the story sequence, ARLDM is given both 1st and 2nd raw (true) images? (1st to 3rd raw images to generate 4th image?)

Assuming from your mentioned "teacher forcing," using the 1st and 2nd raw images to properly generate 2nd image, which is given to synthesize 3rd image.

Is my understanding correct?

xichenpan commented 1 year ago

@KyonP Yeah, exactly!

xichenpan commented 1 year ago

While the second point can be different. We using the 1st and 2nd raw images to properly generate 3nd image.

xichenpan commented 1 year ago

You can refer to the Section 3.2 and Figure 2.a of our paper. https://arxiv.org/abs/2211.10950

KyonP commented 1 year ago

Thanks. 😅

So, during the generation of the 3rd image (in Figure 2.A), the 2nd raw image is given to the auto-regressive process (long arrow on the right-hand side) to force teaching?

Can you give me a link to where the teacher forcing occurring? Maybe within Unet?

BTW, thank you for your speedy reply, I didn't expect it 😄

xichenpan commented 1 year ago

@KyonP Yes. And the teacher forcing is implemented through attention mask. https://github.com/xichenpan/ARLDM/blob/5b03fc4cf78d6509620506a6ca1bd799d6bd9ad4/main.py#L218-L221 Which is passed into Unet through: https://github.com/xichenpan/ARLDM/blob/5b03fc4cf78d6509620506a6ca1bd799d6bd9ad4/main.py#L231

KyonP commented 1 year ago

thanks, I will look into it! 👍