zhvng / open-musiclm

Implementation of MusicLM, a text to music model published by Google Research, with a few modifications.
https://arxiv.org/abs/2301.11325
MIT License
511 stars 59 forks source link

empty db, curious errors, empty output, long gen time, outputs noise #32

Open baardev opened 10 months ago

baardev commented 10 months ago

I am writing here because the discord invite in the README.md is invalid.

I am not sure I am doing this "right". Using the dataset provided on Google Drive and the prompt "violins playing Tchaikovsky", it takes 10 minutes on an RTX 4070Ti to generate tokens and create a 4-second clip of chaotic humming sounds, and when I make a 30 seconds clip, which takes over an hour to generate tokens, it creates a 3 meg file that sounds like car horns under water :/

Is there a preferred prompt to use with the test data? What sounds were sampled to make the test data?

When I tried to sample my own sounds, after 24 hours, the semantic encoding was less than 10% finished. It is "normal' that it should take 10 days to sample a clip?

Also, using the Google Drive data, and --model_config ./model/musiclm_large_small_context.json I get the errors...

`Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

You are using a model of type mert_model to instantiate a model of type hubert. This is not supported for all configurations of models and can yield errors.

What are the correct settings for using the Google Drive data?

My current command is:

python scripts/infer_top_match.py \
    "violins playing Tchaikovsky" \
    --num_samples 4 \
    --num_top_matches 1 \
    --semantic_path   ./model/semantic.transformer.14000.pt \
    --coarse_path     ./model/coarse.transformer.18000.pt \
    --fine_path       ./model/fine.transformer.24000.pt \
    --rvq_path        ./model/clap.rvq.950_no_fusion.pt \
    --kmeans_path     ./model/kmeans_10s_no_fusion.joblib \
    --model_config    ./model/musiclm_large_small_context.json \
    --duration 4

I had to use the Goggle Drive because the code, while not generating any errors, generated a 0 byte preprocessed.db file in the semantic section, which caused errors in the generation section.

Is there a working example of this code somewhere with proper checkpoints?

Thanks