yukara-ikemiya / friendly-stable-audio-tools

Refactored / updated version of `stable-audio-tools` which is an open-source code for audio/music generative models originally by Stability AI.
MIT License
110 stars 9 forks source link

Training diffusion model resulting noise #4

Open kenkalang opened 4 weeks ago

kenkalang commented 4 weeks ago

I've been trying to train diffusion model with stable audio 1.0 config, I also trained the autoencoder with the Stable Audio 1.0 VAE for 50k steps autoencoder result which I then used as the pre-transform ckpt. However, after training the diffusion model for 100k steps, it still results in noise diffusion result. The dataset contains 2,500 hours of songs. I used the same CLAP model as you recomended. i used deepspeed as my training strategy. Is there anything I might have missed in my training?

yukara-ikemiya commented 3 weeks ago

Hi, thank you for using my repository. I haven't tried Stable Audio 1.0 training, but I can help you find the reason. I believe while training the job should save logs at WandB. Could you share some logs of your training such as reconstruction samples, loss curve, and data std?

Btw, some possible issues would be:

  1. VAE checkpoint is not loaded correctly.
  2. Text prompts are not properly fed into a model while training.
kenkalang commented 3 weeks ago

Thanks for your response,

image

https://github.com/user-attachments/assets/38b2a1d9-ab18-43cf-9938-bc79e4fd8494

I noticed some irregularities in the training logs and was optimistic about achieving a good output by 100k steps. I followed the steps outlined in the repository, and everything worked fine with the text prompts, as I confirmed by logging them. I indeed suspect that the VAE might not be loading correctly. How can I verify if the VAE is being loaded properly?

yukara-ikemiya commented 3 weeks ago

Assuming that the VAEs of Stable Audio 1.0 and 2.0 are slightly different, when I tested the training of the SA 2.0 VAE, the data_std tended to be around 0.9. Therefore, the logged data_std value might be suspicious..?

In my training code, I believe reconstruction samples are save to WandB under the name Reconstruction (Pretransform) as well as the above logs. Can you find them in the WandB log?

kenkalang commented 3 weeks ago

I deleted all my previous files since they weren't working anymore, so I retrain the VAE 2.0 following your README instructions and here is the recon pretransform

[original] https://github.com/user-attachments/assets/b5264eee-8d18-4d41-b87d-7f0c7195b51b

[recon]

https://github.com/user-attachments/assets/7a5eac79-f8cd-4ced-bafc-4912a3a9d219

. My question is, could using the DeepSpeed strategy affect the results? Someone mentioned that when they used DeepSpeed, the inference results were noisy, although the demo sounded fine. However, when they used DDP, it worked perfectly. Lastly, may I know the configuration such as batch size, etc and device that you used?

yukara-ikemiya commented 2 weeks ago

Sorry for the late reply. I believe DeepSpeed should not affect training results dractically. If DeepSpeed affected a lot, there might be a bug in checkpoint saving/loading..?

Unfortunately, since I have never tried DeepSpeed options of Lightning, I have no idea if a DeepSpeed implementation in stable-audio-tools has a problem.

Regarding the reconstruction sample you pasted, from my experience of VAE training, the quality should be able to be improved if the model is trained for longer time.