yangdongchao / Text-to-sound-Synthesis

The source code of our paper "Diffsound: discrete diffusion model for text-to-sound generation"
http://dongchaoyang.top/text-to-sound-synthesis-demo/
345 stars 36 forks source link

Embedding shape issue #14

Open YoonjinXD opened 1 year ago

YoonjinXD commented 1 year ago

Hi, I'm trying to use the pretrained weights of the codebook trained on audioset with a size of 512. However, I'm confused about the dimension parameters that should be changed. What should I change for the 'Diffsound/evaluation/caps_text.yaml' ?

Thank you.

YoonjinXD commented 1 year ago

I applied the 2 checkpoints you uploaded, '2022-04-22T19-35-05_audioset_codebook512/checkpoints/last.ckpt' and 'diffsound_audiocaps.pth'. Then I changed the n_embed from 256 to 512. image Then I executed the generate_samples_batch.py, I got this error:

RuntimeError: Error(s) in loading state_dict for DALLE: size mismatch for content_codec.quantize.embedding.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([512, 256]).

If I don't change the n_embed value to 256, I get this error:

RuntimeError: Error(s) in loading state_dict for VQModel: size mismatch for quantize.embedding.weight: copying a param with shape torch.Size([512, 256]) from checkpoint, the shape in current model is torch.Size([256, 256]).

How should I handle the parameters to use the pretrained weights with size 512?

yangdongchao commented 1 year ago

I applied the 2 checkpoints you uploaded, '2022-04-22T19-35-05_audioset_codebook512/checkpoints/last.ckpt' and 'diffsound_audiocaps.pth'. Then I changed the n_embed from 256 to 512. image Then I executed the generate_samples_batch.py, I got this error:

RuntimeError: Error(s) in loading state_dict for DALLE: size mismatch for content_codec.quantize.embedding.weight: copying a param with shape torch.Size([256, 256]) from checkpoint, the shape in current model is torch.Size([512, 256]).

If I don't change the n_embed value to 256, I get this error:

RuntimeError: Error(s) in loading state_dict for VQModel: size mismatch for quantize.embedding.weight: copying a param with shape torch.Size([512, 256]) from checkpoint, the shape in current model is torch.Size([256, 256]).

How should I handle the parameters to use the pretrained weights with size 512?

Hi, Please try to use audioset_codebook256 to inference. Because we only release the diffsound model trained on 256 dimension.

jnwnlee commented 1 year ago

Do you mean the model mentioned in readme.md,

2022/08/09 We upload trained diffsound model on audiocaps dataset, and the baseline AR model, and the codebook trained on audioset with the size of 512. You can refer to https://pan.baidu.com/s/1R9YYxECqa6Fj1t4qbdVvPQ . The password is lsyr
If you can not open the Baidu disk, please try to refer to PKU disk [https://disk.pku.edu.cn:443/link/DA2EAC5BBBF43C9CAB37E0872E50A0E4](https://disk.pku.edu.cn/link/DA2EAC5BBBF43C9CAB37E0872E50A0E4)
More details will be updated as soon as.

which is trained on audiocaps data, also has a dimension of 256? (or, is the code not updated yet for the model above?)

yangdongchao commented 1 year ago

Yes, we have a 256-dimension model. We provide the 512-dimension codebook model, but we donot realease 512-dimension diffsound model.