is this an implement bug?

AvatarVerse3D commented 1 year ago

I notice that in pipeline_mvdiffusion_image.py， when doing inference using 'do_classifier_free_guidance', the camera_embedding and img_embedding are repeated in batch, which is rational。 but the joint attention joints the first half and second half batch， which is probably wrong？the chunked key_0, key_1 that concatenated into one tensor in fact are using the same embedding(i.e. normal-normal joint and rgb-rgb joint)? They indeed should using normal and rgb embedding respectively. Im not sure if I was wrong, how do you think

repeat embedding: camera_embedding = torch.cat([ camera_embedding, camera_embedding ], dim=0) joint attention：

    key_0, key_1 = torch.chunk(key, dim=0, chunks=2)  # keys shape (b t) d c
    value_0, value_1 = torch.chunk(value, dim=0, chunks=2)
    key = torch.cat([key_0, key_1], dim=1)  # (b t) 2d c
    value = torch.cat([value_0, value_1], dim=1)  # (b t) 2d c
    key = torch.cat([key]*2, dim=0)   # ( 2 b t) 2d c
    value = torch.cat([value]*2, dim=0)  # (2 b t) 2d c

JunzheJosephZhu commented 9 months ago

I agree, seems like a bug @flamehaze1115 Could you please take a look?

JunzheJosephZhu commented 9 months ago

Actually, I think it has nothing to do with camera embedding, since this is attention between the two domains, not cross attention to conditioning info. But yeah, still a bug if using CFG.

xxlong0 / Wonder3D

is this an implement bug? #46