Progress callback for Moondream?

xenova / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!

Apache License 2.0

9.71k stars 571 forks source link

Question

While implementing Moondream (from the excellent example) I stumbled upon a few questions.

How can I implement a callback while Moondream is generating tokens? A normal progressCallback didn’t work?

self.model.generate({
    ...text_inputs,
    ...vision_inputs, 
    do_sample: false,
    max_new_tokens: 500,

    progress_callback: (progress_data) => {
        console.log("progress_data: ", progress_data);
        if (progress_data.status !== 'progress') return;
        self.postMessage(progress_data);
    },
})

I’ve also tried the new CallbackStreamer option, but that had no effect either.

From the demo I know it should be possible. But I couldn't find the source code for it (yet). And trying to learn anything from the demo as-is was, well, difficult with all that minifying and framework stuff.

Is this warning in the browser console anything to worry about?

The number of image tokens was not set in the model configuration. Setting it to the number of features detected by the vision encoder (729).models.js:3420

What would be the effect of changing these values? E.g. what would be the expected outcome of changing decoder_model_merged from from q4 to q8?
```
embed_tokens: 'fp16',
vision_encoder: 'q8', // or 'fp16'
decoder_model_merged: 'q4', // or 'q8'
```
What's the difference between Moondream and NanoLlava? When should I use one over the other?

How can I implement a callback while Moondream is generating tokens? A normal progressCallback didn’t work?

Hi there 👋 The streamer API should work for this - check out here for example usage.

But I couldn't find the source code for it (yet)

I've uploaded the source code for the VLM demo here.

Is this warning in the browser console anything to worry about?

No need to worry about :)

What would be the effect of changing these values? E.g. what would be the expected outcome of changing decoder_model_merged from from q4 to q8?

It's the quantization level, so q4 means 4-bit weights, q8 means 8-bit weights, etc. The matmul ops are pretty efficient for the q4 model in WebGPU, so I would recommend keeping the decoder set to q4. q8 should get slightly better accuracy though.

What's the difference between Moondream and NanoLlava? When should I use one over the other?

Architecturally very similar, just with different components:

nanoLLaVA: SigLip vision encoder + Qwen1.5 LLM. In total it has 1.05B params.
moondream: custom ViT-based vision encoder + Phi LLM. In total it has 1.87B params

xenova / transformers.js

Progress callback for Moondream? #781

Question