Models workings, but audio quality is not good.

zhvng / open-musiclm

Implementation of MusicLM, a text to music model published by Google Research, with a few modifications.

https://arxiv.org/abs/2301.11325

MIT License

522 stars 59 forks source link

Models workings, but audio quality is not good. #3

Closed Saltb0xApps closed 1 year ago

Saltb0xApps commented 1 year ago

Hi Zhvng, I've been able to train all the models, generate the checkpoints and run infer.py. The library is able to output generated audio files now, and it has some "hints" of music, but it sounds like one of the models is broken.

Do you have any ideas or guidance on what I can do to improve the generated audio quality?

Info about the models I've trained for testing -

clap_rvq - 950 steps
Hubert Kmeans - Completed training. Stopped automatically due to a lack of convergence at 60.
Semantic - 6000 steps
Fine - 4000 steps
Coarse - 4000 steps These models are trained on the fma_large dataset.

To get better audio quality, should I just continue training the models with more training steps, or are there unsolved technical challenges that need to be resolved before I can start retraining? I've linked the current state of the audio output below.

https://cassetteai.com/generations/gen_0.wav (chirping of birds and the distant echos of bells) https://cassetteai.com/generations/gen_1.wav (cat meowing)

zhvng commented 1 year ago

Hey Akhil, that's great that you were able to get outputs! I'm actually getting similar results with my attempts, where there are "hints" of music but mostly noise. I think there might be unsolved technical challenges, so I am currently stepping through the code and trying to identify issues. Right now I'm guessing that we're not properly learning the clap RVQ, so I'm trying some methods to increase codebook usage & lower the loss (bc90bb7). I also think there could be something wrong with the semantic tokens so I'm training a kmeans checkpoint with more data to see if that will remedy it.

Or maybe we just need to train for longer, who knows ? 😅

zhvng commented 1 year ago

I'm also planning on adding some sort of sampling during training so we can hear the model outputs at different training steps, which could help identify which parts are broken

Saltb0xApps commented 1 year ago

@zhvng Sounds great! I've begun training all the models again with much higher step count to see if that improves things. Please let us know when you think you're starting to get closer to reducing the noise!

Really appreciate your work on this and happy to support you in any way possible!

zhvng commented 1 year ago

@Saltb0xApps I left my previous models training overnight (without the changes in my last comment) and got the following results:

https://drive.google.com/drive/u/0/folders/1eMdW3CikhfkzSS0c7dQ1w381r6wWFsA2

(semantic stage 11k steps, coarse stage 9k steps, reconstruct coarse wave)

Although there is still a lot of noise, it seems that some association between text and audio is happening. One thing I noticed is that the samples tend to start out interesting and then degrade into noise after a few seconds. I am trying to figure out why this is the case.

I stopped training for now, but am curious what it would sound like after even more steps. Looking forward to hearing your results!

smcio commented 1 year ago

Thanks for your work here @zhvng 🙂

Looking forward to hearing the results after further training as well @Saltb0xApps! 🙂 I strongly suspect that you'll need many more steps than 4000 btw - more like hundreds of thousands. I'm considering starting up my own training effort on this as well.

zhvng commented 1 year ago

@smcio I agree! that is definitely the case. Any training efforts would be appreciated! I also suspect that we need more data, as the paper mentions training the semantic and acoustic stages on 5 million audio clips, while fma only has ~100k. I'm hoping we can get some preliminary results with a smaller model, however.

zhvng commented 1 year ago

also @Saltb0xApps I've fixed a few bugs since you started training, including this very important issue of cropping the audio to the same location while generating the tokens. Might be worth it to restart with the latest code!

Saltb0xApps commented 1 year ago

@zhvng @smcio Thanks for the update, i'll restart training with the updated code. Do you think we should create a shared google drive link, discord, or similar platform to track the progress as we aim to train multiple stages with 100k+ steps each?

Saltb0xApps commented 1 year ago

I don't think a larger dataset would be a major issue as even 100k sample size should be good enough to get some decent results. The only difference would be that Google's MusicLM would be able to generate outputs for more diverse prompts, whereas a model trained on FMA would work for specific prompts?

zhvng commented 1 year ago

@Saltb0xApps discord works! here's an invite link to the newly created open musiclm server: https://discord.gg/jN8jADShX5

We can also talk and share results in the audio-generation channel in the LAION discord, but I think that having our own server would make it easier to communicate & track training progress

zhvng commented 1 year ago

finally closing because results seem to be decent now!

lvZic commented 1 year ago

finally closing because results seem to be decent now!

@zhvng hi, i wanna know the key to get final result, the training epoch or some code bugs? By the way, i can't join the musiclm server: https://discord.gg/jN8jADShX5. Can u share it again, thanks!