Error when trying MusicGen example - 127037464

flatsiedatsie commented 5 months ago

System Info

Latest version of V3
Running in webworker
Macbook pro 14
Brave browser

The error being caught is a number, which stays the same on each run: 127037464.

To remove as many variables as possible I then tried a simpler version of the example. Unfortunately I saw the same error, just with a different number: Uncaught 168274888

Environment/Platform

[X] Website/web-app
[ ] Browser extension
[ ] Server-side (e.g., Node.js, Deno, Bun)
[ ] Desktop app (e.g., Electron)
[ ] Other (e.g., VSCode extension)

Description

The MusicGen example generates an error instead of an audio array.

Reproduction

Steps taken to test:

git clone -b v3 https://github.com/xenova/transformers.js.git
cd transformers.js/
npm i
npm run build

Then using the contents of dist as the js folder in this minimal example:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="ie=edge">
    <title>Musicgen</title>
  </head>
  <body>
    <script type="module">
        import { AutoTokenizer, MusicgenForConditionalGeneration } from './js/transformers.js';

        // Load tokenizer and model
        const tokenizer = await AutoTokenizer.from_pretrained('Xenova/musicgen-small');
        const model = await MusicgenForConditionalGeneration.from_pretrained(
          'Xenova/musicgen-small', { dtype: 'fp32' }
        );

        // Prepare text input
        const prompt = '80s pop track with bassy drums and synth';
        const inputs = tokenizer(prompt);

        // Generate audio
        const audio_values = await model.generate({
          ...inputs,
          max_new_tokens: 512,
          do_sample: true,
          guidance_scale: 3,
        });

        console.log("audio_values: ", audio_values);
        /*
        // (Optional) Write the output to a WAV file
        import { wavefile } from './js/wavefile.js';

        const wav = new wavefile.WaveFile();
        wav.fromScratch(1, model.config.audio_encoder.sampling_rate, '32f', audio_values.data);
        */
    </script>

  </body>
</html>

flatsiedatsie commented 5 months ago

Same error in Firefox. 168230664

xenova commented 5 months ago

Hi there! This is due to an out-of-memory error, which is primarily due to the fact that you're loading in full-precision (fp32). The code in the v3 thread was only tested with Node.js, as shown by the use of fs. Fortunately, we're almost done with the WebGPU implementation, which will work in the browser with fp16 quantization (possible even lower).

I will update you when it does work!

One think you could try is to set guidance_scale to null, as specifying a value > 1 will increase the batch size (+ memory) 2x.

flatsiedatsie commented 5 months ago

Thanks for the tip. I tried it, but with guidancescale set to null I unfortunately still get the error. 168274888. I should have gone for the 32GB Macbook..

The code in the v3 thread was only tested with Node.js, as shown by the use of fs

I started to suspect as much.

I will update you when it does work!

Rock'n. My code it now ready :-)

xenova commented 5 months ago

I've uploaded q8 (uint8/int8) weights for the model. Can you try it out? Setting {dtype: 'q8'} will use those weights.

flatsiedatsie commented 5 months ago

Happy to!

It results in the following error:

Uncaught Error: failed to call OrtRun(). ERROR_CODE: 2, ERROR_MESSAGE: Got invalid dimensions for input: past_key_values.0.encoder.key for the following indices
 index: 1 Got: 16 Expected: 12
 Please fix either the inputs/outputs or the model.

It refers to the following (minified) code:

}
      , ke = e=>{
        let t = Ve()
          , r = t.stackSave();
        try {
            let o = t.stackAlloc(8);
            t._OrtGetLastError(o, o + 4);
            let n = t.HEAP32[o / 4]
              , s = t.HEAPU32[o / 4 + 1]
              , u = s ? t.UTF8ToString(s) : "";
            throw new Error(`${e} ERROR_CODE: ${n}, ERROR_MESSAGE: ${u}`)
        } finally {
            t.stackRestore(r)
        }
    }

and

K !== 0 && ke("failed to call OrtRun().");
            let Q = [];
            for (let Z = 0; Z < w; Z++) {
                let Ee = u.HEAPU32[R / 4 + Z];

(switching guidance_scale from null back to 3 has no effect)

xenova commented 5 months ago

Oh thanks, I know what the issue is and I'll fix it tomorrow!

flatsiedatsie commented 5 months ago

I can send you an email with a sneak-preview of my project if you'd like.

xenova commented 5 months ago

Okay the corrected quantized weights are now up! Please let me know if it works now. 😇

(Just remember to clear the cache)

flatsiedatsie commented 5 months ago

I think it works!

Screenshot 2024-04-09 at 17 06 13

I'll try to extract that float32 array and turn it into a wav file.

flatsiedatsie commented 5 months ago

I've created a Github pages site for my test.

https://flatsiedatsie.github.io/transformers_js_musicgen/

I'm not able to get proper audio yet, it becomes glitchy after a second, but I'm sure that's an error on my end. I'll keep fiddling.

flatsiedatsie commented 5 months ago

Here's an example. glitchy.wav.zip

flatsiedatsie commented 5 months ago

Greatly increasing the Guidance Scale seems to be effective. Its still glitchy at 10, but less so.

xenova commented 5 months ago

Thanks for all your testing! I don't think this is an issue from your side, I think it's a problem with the quantization settings. Will do some investigating tomorrow.

In the meantime, you can try this version (with reduce_range=True and per_channel=True):


const model = await MusicgenForConditionalGeneration.from_pretrained(
'Xenova/musicgen-small', { dtype: 'q8', revision: 'refs/pr/9'}
);

flatsiedatsie commented 5 months ago

That did it!

IT WORKS!

flatsiedatsie commented 5 months ago

I've updated the online example on Github.

I have a new remaining questions to optimize the implementation:

I assume the 'refs/pr/9' version will replace the broken 8 bit model? So in future the 'refs/pr/9' addition is not needed?
Currently the audio array is found at audio_values.ort_tensor.cpuData. Is this path something you intend to abstract away? So it's more in line with other pipelines that output audio?
cpuData implies that the process is not running on WebGPU (yet?). Is that correct?
Is there a way to get progress updates during the generation? Currently I've implemented a hacky approach, which captures and counts TODO: update decoder attention mask messages sent to console.warn. This is obviously not a long-term solution :-)
Could you explain a bit what the parameters like 'do_sample' do? Then I can add that info to the demo.

Once that is clear, and if you're OK with it, I'd like to share an update on Reddit LocalLlama so people can try it.

If you'd like (read: if it would save you time) I could rework the online example on Github to become a demo for Transformers.js.

xenova commented 5 months ago

Wow that is amazing! 🔥 Great stuff! 🚀 The 8-bit quantized version still seems to have some issues (audio is not perfect), but I'll play around with a few more things to try get it working better!

To answer your questions:

I assume the 'refs/pr/9' version will replace the broken 8 bit model? So in future the 'refs/pr/9' addition is not needed?

That's right :) I'll do some more exploration of the effect different quantization settings have on the output (as well as trying out different settings for each sub-model (text-encoder, musicgen-decoder, encodec-decoder).

Currently the audio array is found at audio_values.ort_tensor.cpuData. Is this path something you intend to abstract away? So it's more in line with other pipelines that output audio?

You should be able to do audio_values.data to get it. When this works with the text-to-audio pipeline, the API will be much easier to interact with, including being able to save and play the audio with .save() and .play(), thanks to https://github.com/xenova/transformers.js/pull/682.

Is there a way to get progress updates during the generation?

We're planning on updating the API to include support for a Streamer (docs in transformers), which will run a function whenever a new token is generated. Stay tuned :)

Could you explain a bit what the parameters like 'do_sample' do? Then I can add that info to the demo.

This enables sampling the predicted probability distribution to produce the next token. If set to false, the model will generated "greedily" (choosing the most probable token at each step). do_sample=true means the model can generate different songs each generation. For musicgen, it's highly encouraged to keep this set to true, otherwise the model can get "stuck" and produce noise. See here for more information.

Once that is clear, and if you're OK with it, I'd like to share an update on Reddit LocalLlama so people can try it.

Absolutely! Go for it :)

If you'd like (read: if it would save you time) I could rework the online example on Github to become a demo for Transformers.js.

I think a demo will be good once we've got WebGPU support working (and would make everything significantly faster), so stay tuned for that!

xenova commented 5 months ago

Hi again! I've done some additional testing and added per-model dtypes and devices, so you can do the following:

const model = await MusicgenForConditionalGeneration.from_pretrained(model_id, {
    dtype: {
        text_encoder: 'q8', // or 'fp32'. Both seem to work well, but q8 provides 4x memory reduction.
        decoder_model_merged: 'q8', // IMPORTANT: otherwise, you'll get out-of-memory issues
        encodec_decode: 'fp32', // IMPORTANT: If not full-precision, quality won't be very good.
    },
    device: {
        text_encoder: 'webgpu', // much faster :)
        decoder_model_merged: 'wasm', // webgpu is slower at the moment due to inefficient buffer reuse. Will fix.
        encodec_decode: 'wasm', // webgpu is currently broken (known upstream bug in onnxruntime-web). Will be fixed soon.
    },
});

Also, I've merged the PR which improved quantization settings (so you don't need to specify revision), so just remember to clear your cache in case it's still using the old files.

The output is pretty good now!

https://github.com/xenova/transformers.js/assets/26504141/138eeb3e-adf9-4410-87e1-7ace0d618d2b

Next step will be adding token streaming so you can get the progress in a non-hacky way. :)

flatsiedatsie commented 5 months ago

Rock'n!

webgpu is slower at the moment

Has the world gone mad?! Dogs and cats living together! Mass hysteria!

(I'll update the demo)

xenova commented 5 months ago

Hey again 👋 I put out a (very simple) demo myself: https://huggingface.co/spaces/Xenova/musicgen-web

https://github.com/xenova/transformers.js/assets/26504141/f20b683d-2fd5-4a66-81e1-775c859a0c51

The progress tracking is now possible thanks to the Streamer API, which I added today. I've added the source code to the examples folder, which will hopefully help you out too! https://github.com/xenova/transformers.js/tree/v3/examples/musicgen-web.

flatsiedatsie commented 5 months ago

Man, fantastic. Gonna be a fun day today :-)

flatsiedatsie commented 5 months ago

I updated my example on Github. The streamer API is great, thank you.

Would it be an idea to also point people to my example on your MusicGen example's readme page on github? Since it's a vanilla JS implementation that can be copied and run immediately, it might be useful as a starting point for less advanced developers (a group in which I include myself). It can be tricky to learn how the code works, as 'view source' on a Vercel app gives absolutely no insight.

See for example: https://www.reddit.com/r/LocalLLaMA/comments/1c2d5ff/comment/kzagfqq/

If you're open to multiple variations of examples I could also do a PR to add it to the Transformers V3 repo.

Or not, it's all good.

xenova / transformers.js