xenova / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!
https://huggingface.co/docs/transformers.js
Apache License 2.0
9.71k stars 571 forks source link

Progress callback for Moondream? #781

Closed flatsiedatsie closed 3 weeks ago

flatsiedatsie commented 4 weeks ago

Question

While implementing Moondream (from the excellent example) I stumbled upon a few questions.

self.model.generate({
    ...text_inputs,
    ...vision_inputs, 
    do_sample: false,
    max_new_tokens: 500,

    progress_callback: (progress_data) => {
        console.log("progress_data: ", progress_data);
        if (progress_data.status !== 'progress') return;
        self.postMessage(progress_data);
    },
})

I’ve also tried the new CallbackStreamer option, but that had no effect either.

From the demo I know it should be possible. But I couldn't find the source code for it (yet). And trying to learn anything from the demo as-is was, well, difficult with all that minifying and framework stuff.

xenova commented 3 weeks ago

How can I implement a callback while Moondream is generating tokens? A normal progressCallback didn’t work?

Hi there 👋 The streamer API should work for this - check out here for example usage.

But I couldn't find the source code for it (yet)

I've uploaded the source code for the VLM demo here.

Is this warning in the browser console anything to worry about?

No need to worry about :)

What would be the effect of changing these values? E.g. what would be the expected outcome of changing decoder_model_merged from from q4 to q8?

It's the quantization level, so q4 means 4-bit weights, q8 means 8-bit weights, etc. The matmul ops are pretty efficient for the q4 model in WebGPU, so I would recommend keeping the decoder set to q4. q8 should get slightly better accuracy though.

What's the difference between Moondream and NanoLlava? When should I use one over the other?

Architecturally very similar, just with different components: