yashkant / spad

Code for SPAD : Spatially Aware Multiview Diffusers, CVPR 2024
https://yashkant.github.io/spad/
127 stars 4 forks source link

add spad to hf hub #9

Open jadechoghari opened 1 month ago

jadechoghari commented 1 month ago

Hey Yash @yashkant, Congrats on spad, it's a very cool model!! You should consider adding SPAD to the Hugging Face Hub 🤗, Doing so could increase the model's visibility and make it easier for everyone to use. It can be easily compatible with the Hugging Face Transformers library.

Can work on that!

Best, Jade

yashkant commented 1 month ago

hi jade, thanks for your suggestion.

i believe spad would be a more suitable candidate for the diffusers library, and can be integrated similar to zero123 (https://github.com/huggingface/diffusers/issues/4096).

i am currently at meta, and do not have sufficient bandwidth to work on this. however, if you'd like to work on this, i would be happy to discuss feasibility and review PRs etc.

jadechoghari commented 1 month ago

Yes, sounds good, I'll see which library is best. Will do the work (HF team can help us)! If any issues is encountered or I have questions during the implementation, I'll let you know :)

jadechoghari commented 1 month ago

small updates: Working on it!: 🤗 https://huggingface.co/jadechoghari/spad

Any idea about the AutoEncoderKL architecture you used for the VAE: I'm renaming keys, getting some key errors, in the process of fixing it: https://huggingface.co/jadechoghari/spad/tree/main/vae?show_file_info=vae%2Fdiffusion_pytorch_model.safetensors

Best.

yashkant commented 1 month ago

so, i used stable diffusion v1.5 for experiments (https://github.com/CompVis/stable-diffusion).

i think there should be a script somewhere that translates from compvis to diffusers.

can you try this one perhaps: https://github.com/huggingface/diffusers/issues/3264#issuecomment-1527335827

thx!

jadechoghari commented 1 month ago

Yup, successfully converted. Thanks!

jadechoghari commented 1 month ago

Any thoughts about the UNet conversion script? Which UNET architecture specifically? I'm still obv guessing LDM though :)

Thanks.

jadechoghari commented 4 weeks ago

Though that’s the fastest and most efficient conversion script 🤗👇 convert_original_stable_diffusion_to_diffusers.py,

I am encountering the following error with both checkpoints:

Traceback (most recent call last):
  File "/content/convert.py", line 160, in <module>
    pipe = download_from_original_stable_diffusion_ckpt(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/stable_diffusion/convert_from_ckpt.py", line 1482, in download_from_original_stable_diffusion_ckpt
    set_module_tensor_to_device(unet, param_name, "cpu", value=param)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py", line 362, in set_module_tensor_to_device
    raise ValueError(
**ValueError: Trying to set a tensor of shape torch.Size([320, 326, 1, 1]) in "weight" (which has shape torch.Size([320, 320, 1, 1])), this looks incorrect.**

It might be a configuration error. (SPAD's yaml files or any change in the architecture..?). Any ideas about that? (320 -> 326) (+6)

Thanks!

yashkant commented 4 weeks ago

so spad uses a custom unet model, which is one the main contributions of our work: https://github.com/yashkant/spad/blob/main/spad/mv_unet.py

you can load and test this unet separately here: https://github.com/yashkant/spad/blob/main/spad/mv_unet.py#L166

i dont think it would natively work with existing diffusers conversion scripts; and may just require adding a new file to get it to work.

jadechoghari commented 4 weeks ago

Yes exact, We will create a custom one...!

jadechoghari commented 3 weeks ago

Small updates...

Testing is complete, but I’m getting random images. I suspect the issue is with the usage of the unet in the newly added diffusers SPAD pipeline. Specifically, there might be some problems with value initialization (e.g., shape, etc.).

Here’s the code for reference: pipeline_spad.py

A quick check on how I’m initializing the values for latents, epi_constraint_masks, plucker_embeds, and context would be appreciated, especially as we’re approaching the end.

Small note: I’m using DDIMSampler instead of MultiviewSampler. Does this affect performance significantly? (will update it later)

Thanks !

yashkant commented 3 weeks ago

hi jade, great progress!

A quick check on how I’m initializing the values for latents, epi_constraint_masks, plucker_embeds, and context would be appreciated, especially as we’re approaching the end.

sure, will take a closer look at the script over the weekend.

can you share what type of images do you see being generated currently?

Small note: I’m using DDIMSampler instead of MultiviewSampler. Does this affect performance significantly? (will update it later)

yes, it does! multiview sampler denoises many views in a single pass (with an additional views axis in denoising tensor): here

w/o multiview sampler you may not be generating multiple views of one object, but rather just separate views from different objects.

jadechoghari commented 3 weeks ago

@yashkant sure! - i feel it's random: Screenshot 2024-08-24 at 7 52 19 PM

yashkant commented 2 weeks ago

yep, it seems like the denoising (ddim sampling) is broken.

i had a look at the pipeline_spad.py; it seems like we need to ensure the samplers match.

to do this, can you run inference on spad and the pipeline code both, and ensure the inputs to model (on the below two places are identical):

the context (text embedding, camera embedding and epipolar masks) should be identical.

another note: during denoising the latents are initialized with random noise which can be different for both sides. you can also just lock the random noise by saving it to a numpy file and using it (for matching exact outputs).

jadechoghari commented 2 weeks ago

Thank you! I'll have a look and let you know!

jadechoghari commented 2 weeks ago

After looking into the code more, it seems we might have an issue with the Plucker embedding. Do you update the Plucker embedding after each time step? How can we generate the Plucker embedding? How crucial pluckers are? Thanks!

yashkant commented 2 weeks ago

Do you update the Plucker embedding after each time step?

nope, it stays same for each denoising step, same with epipolar_mask. only timestep should change.

you can confirm this bit in sampler: here.

How can we generate the Plucker embedding?

it is computed at this point in inference pipeline: here. if you match the inputs to that function correctly, the generated epipolar mask + plucker should be identical on both sides.

How crucial pluckers are?

quite crucial, since we want to make our model to be spatially aware, we want to inject 3d information into it. this is done using two things: epipolar attention masking + plucker embeddings.

jadechoghari commented 2 weeks ago

Thanks for your help. I have checked everything and saved them as specified in the .npy file. However, I am still getting different results. I suspect the issue might be with the UNet. What do you think?

The VAE is working nicely, and I have tested it. Here is what I have done for the latent input:

  1. Initialized x either randomly or using the following code:
# Load npy from h.npy
x = torch.zeros(2, 8, 256, 256, 3)
blob = get_gaussian_image(sigma=0.5)
x[:,:] = blob
x = rearrange(x, "n v h w c -> n v c h w")
x = rearrange(x, "n v c h w -> (n v) c h w")

x = x.to(torch.float32)

# Encode the input tensor into a latent representation
# latent = vae.encode(x).latent_dist.sample()
latent = vae.encode(x).latent_dist
z = latent.sample()
z = rearrange(z, "(n v) c h w -> n v c h w", v=8)

I also tried initializing the latents randomly.

  1. Concatenated x and plucker and passed latents of shape (2, 10, 32, 32) to UNet:
latents = torch.cat((z, plucker), dim=2)

I am still getting one of the following two results:

Since the inputs are the same, the UNet might not be denoising correctly. If you have time to take a look at the current mv_unet.py on Hugging Face: https://huggingface.co/jadechoghari/spad/blob/main/unet/mv_unet.py would be great! If you don't find any issues in the code, I could reach out to others at Hugging Face for help. It’s almost done! :) 🚀

yashkant commented 2 weeks ago

ack! i will take a look, and get back on/before monday.

jadechoghari commented 2 weeks ago

awesome!