Error using Xenova/nanoLLaVA in pipeline

kendelljoseph commented 1 month ago

System Info

Using:

Node v21.7.1
Mac M1

Environment/Platform

[X] Website/web-app
[X] Server-side (e.g., Node.js, Deno, Bun)

Description

https://huggingface.co/Xenova/nanoLLaVA

New model nanoLLaVA threw this error:

Unknown model class "llava", attempting to construct from base class.
Model type for 'llava' not found, assuming encoder-only architecture.

 Error: Could not locate file: "https://huggingface.co/Xenova/nanoLLaVA/resolve/main/onnx/model_quantized.onnx".

Reproduction

Use Xenova/nanoLLaVA like this:

const featureExtractor = await transformers.pipeline('image-feature-extraction', 'Xenova/nanoLLaVA')

package.json

"@xenova/transformers": "^2.17.1",

xenova commented 1 month ago

I appreciate your enthusiasm with testing the model out, since I only added it a few hours ago... but I'm still adding support for it to the library! I will let you know when it is supported.

kendelljoseph commented 1 month ago

Brilliant, thank you very much!

I'm closely watching this feature, and if you link a PR for this I can glean from the work and help maintain the code!

xenova commented 1 month ago

You can follow along in the v3 branch: https://github.com/xenova/transformers.js/pull/545

Here's some example code which should work:


import { AutoTokenizer, AutoProcessor, RawImage, LlavaForConditionalGeneration } from '@xenova/transformers';

// Load tokenizer, processor and model
const model_id = 'Xenova/nanoLLaVA';
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await LlavaForConditionalGeneration.from_pretrained(model_id, {
    dtype: {
        embed_tokens: 'fp16',
        vision_encoder: 'q8', // or 'fp16'
        decoder_model_merged: 'q4', // or 'q8'
    },
});

// Prepare text inputs
const prompt = 'Describe this image in detail';
const messages = [
    { 'role': 'user', 'content': `<image>\n${prompt}` }
]
const text = tokenizer.apply_chat_template(messages, { tokenize: false, add_generation_prompt: true })
const text_inputs = tokenizer(text, { padding: true });

// Prepare vision inputs
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg'
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);

// Generate response
const inputs = { ...text_inputs, ...vision_inputs };
const output = await model.generate({
    ...inputs,
    do_sample: false,
    max_new_tokens: 64,
});

// Decode output
const decoded = tokenizer.batch_decode(output, { skip_special_tokens: false });
console.log('decoded', decoded);

Note that this may change in future, and I'll update the model card when I've done some more testing.

xenova commented 1 month ago

The model card has been updated with example code 👍 https://huggingface.co/Xenova/nanoLLaVA

We also put an online demo out for you to try: https://huggingface.co/spaces/Xenova/experimental-nanollava-webgpu

Example videos:

https://github.com/xenova/transformers.js/assets/26504141/3f70437a-8943-44e4-87f0-795df90327f2

https://github.com/xenova/transformers.js/assets/26504141/10c0b4c1-2738-4dbc-ad2f-115f7248dd84

xenova / transformers.js