zhvng / open-musiclm

Implementation of MusicLM, a text to music model published by Google Research, with a few modifications.
https://arxiv.org/abs/2301.11325
MIT License
511 stars 59 forks source link

Hubert args normalization #7

Closed LWprogramming closed 1 year ago

LWprogramming commented 1 year ago

I looked at the MERT example and noticed that it's actually preprocessing the input. The code looks really convoluted but in the case of batch size 1, the net effect is that it's normalizing things to have zero mean and unit variance instead of passing it in directly.

Note that you can use the processor provided in the example if you want, but I decided not to for my case because my wav_input is already in CUDA and transformers forces everything into numpy and therefore CPU 😞 , resulting in expensive copies as you move data back and forth. Wasn't sure which one you've got here :))

zhvng commented 1 year ago

hey @LWprogramming , thanks for taking a look at the code! The input is normalized to zero mean unit variance when loading the data here. The normalize argument is set when initializing the datasets in trainer. This will normalize the data before it is cropped.

Alternatively we could normalize the cropped input right before passing it into MERT. Not sure which would be better, but normalizing it in the beginning made more sense to me.

zhvng commented 1 year ago

also your comment made me realize there was an issue in the infer coarse script where I forgot to normalize the audio before passing it in. fixed in c9a167e!