Open AvatarVerse3D opened 1 year ago
I agree, seems like a bug @flamehaze1115 Could you please take a look?
Actually, I think it has nothing to do with camera embedding, since this is attention between the two domains, not cross attention to conditioning info. But yeah, still a bug if using CFG.
I notice that in pipeline_mvdiffusion_image.py, when doing inference using 'do_classifier_free_guidance', the camera_embedding and img_embedding are repeated in batch, which is rational。 but the joint attention joints the first half and second half batch, which is probably wrong?the chunked key_0, key_1 that concatenated into one tensor in fact are using the same embedding(i.e. normal-normal joint and rgb-rgb joint)? They indeed should using normal and rgb embedding respectively. Im not sure if I was wrong, how do you think
repeat embedding:
camera_embedding = torch.cat([ camera_embedding, camera_embedding ], dim=0)
joint attention: