Closed Saltb0xApps closed 1 year ago
Hey Akhil, that's great that you were able to get outputs! I'm actually getting similar results with my attempts, where there are "hints" of music but mostly noise. I think there might be unsolved technical challenges, so I am currently stepping through the code and trying to identify issues. Right now I'm guessing that we're not properly learning the clap RVQ, so I'm trying some methods to increase codebook usage & lower the loss (bc90bb7). I also think there could be something wrong with the semantic tokens so I'm training a kmeans checkpoint with more data to see if that will remedy it.
Or maybe we just need to train for longer, who knows ? 😅
I'm also planning on adding some sort of sampling during training so we can hear the model outputs at different training steps, which could help identify which parts are broken
@zhvng Sounds great! I've begun training all the models again with much higher step count to see if that improves things. Please let us know when you think you're starting to get closer to reducing the noise!
Really appreciate your work on this and happy to support you in any way possible!
@Saltb0xApps I left my previous models training overnight (without the changes in my last comment) and got the following results:
https://drive.google.com/drive/u/0/folders/1eMdW3CikhfkzSS0c7dQ1w381r6wWFsA2
(semantic stage 11k steps, coarse stage 9k steps, reconstruct coarse wave)
Although there is still a lot of noise, it seems that some association between text and audio is happening. One thing I noticed is that the samples tend to start out interesting and then degrade into noise after a few seconds. I am trying to figure out why this is the case.
I stopped training for now, but am curious what it would sound like after even more steps. Looking forward to hearing your results!
Thanks for your work here @zhvng 🙂
Looking forward to hearing the results after further training as well @Saltb0xApps! 🙂 I strongly suspect that you'll need many more steps than 4000 btw - more like hundreds of thousands. I'm considering starting up my own training effort on this as well.
@smcio I agree! that is definitely the case. Any training efforts would be appreciated! I also suspect that we need more data, as the paper mentions training the semantic and acoustic stages on 5 million audio clips, while fma only has ~100k. I'm hoping we can get some preliminary results with a smaller model, however.
also @Saltb0xApps I've fixed a few bugs since you started training, including this very important issue of cropping the audio to the same location while generating the tokens. Might be worth it to restart with the latest code!
@zhvng @smcio Thanks for the update, i'll restart training with the updated code. Do you think we should create a shared google drive link, discord, or similar platform to track the progress as we aim to train multiple stages with 100k+ steps each?
I don't think a larger dataset would be a major issue as even 100k sample size should be good enough to get some decent results. The only difference would be that Google's MusicLM would be able to generate outputs for more diverse prompts, whereas a model trained on FMA would work for specific prompts?
@Saltb0xApps discord works! here's an invite link to the newly created open musiclm server: https://discord.gg/jN8jADShX5
We can also talk and share results in the audio-generation channel in the LAION discord, but I think that having our own server would make it easier to communicate & track training progress
finally closing because results seem to be decent now!
finally closing because results seem to be decent now!
@zhvng hi, i wanna know the key to get final result, the training epoch or some code bugs? By the way, i can't join the musiclm server: https://discord.gg/jN8jADShX5. Can u share it again, thanks!
Hi Zhvng, I've been able to train all the models, generate the checkpoints and run infer.py. The library is able to output generated audio files now, and it has some "hints" of music, but it sounds like one of the models is broken.
Do you have any ideas or guidance on what I can do to improve the generated audio quality?
Info about the models I've trained for testing -
To get better audio quality, should I just continue training the models with more training steps, or are there unsolved technical challenges that need to be resolved before I can start retraining? I've linked the current state of the audio output below.
https://cassetteai.com/generations/gen_0.wav (chirping of birds and the distant echos of bells) https://cassetteai.com/generations/gen_1.wav (cat meowing)