xenova / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!
https://huggingface.co/docs/transformers.js
Apache License 2.0
9.71k stars 571 forks source link

jinaai/jina-clip-v1: support for model names with prefixes #793

Open do-me opened 3 weeks ago

do-me commented 3 weeks ago

Model description

jinaai/jina-clip-v1

Prerequisites

Additional information

You just added the onnx files to their HF repo, that's great! 🥳

Now that model files are getting more complex and have a prefix like text_ or vision_ (or even audio_ in the future), transformers.js needs an update as it doesn't support loading files other than model.onnx or model_quantized.onnx if see it correctly. You'll get this kind of error atm with 17.2 as it cannot locate the files with above prefixes:

Uncaught (in promise) Error: Could not locate file: "https://huggingface.co/jinaai/jina-clip-v1/resolve/main/onnx/model_quantized.onnx".
    at handleError (webpack://semanticfinder/./node_modules/@xenova/transformers/src/utils/hub.js?:248:11)
    at getModelFile (webpack://semanticfinder/./node_modules/@xenova/transformers/src/utils/hub.js?:481:24)
    at async constructSession (webpack://semanticfinder/./node_modules/@xenova/transformers/src/models.js?:451:18)
    at async Promise.all (index 1)
    at async PreTrainedModel.from_pretrained (webpack://semanticfinder/./node_modules/@xenova/transformers/src/models.js?:1121:20)
    at async AutoModel.from_pretrained (webpack://semanticfinder/./node_modules/@xenova/transformers/src/models.js?:5852:20)
    at async Promise.all (index 1)
    at async loadItems (webpack://semanticfinder/./node_modules/@xenova/transformers/src/pipelines.js?:3269:5)
    at async pipeline (webpack://semanticfinder/./node_modules/@xenova/transformers/src/pipelines.js?:3209:21)
    at async self.onmessage (webpack://semanticfinder/./src/js/worker.js?:420:24)

You're probably already working on this, but I still though it might be useful to have it documented here for anyone else looking for support.

Or is there already another way to specify the name?

Your contribution

I can gladly test!

xenova commented 3 weeks ago

You can specify model_file_name as one of the options in .from_pretrained(model_id, { model_file_name: 'model' } :) Although, do note that the weights I uploaded only work for Transformers.js v3 (unless you manually override the onnxruntime-web/node version to >= 1.16.0).

See the README for example Transformers.js code:

import { AutoTokenizer, CLIPTextModelWithProjection, AutoProcessor, CLIPVisionModelWithProjection, RawImage, cos_sim } from '@xenova/transformers';

// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained('jinaai/jina-clip-v1');
const text_model = await CLIPTextModelWithProjection.from_pretrained('jinaai/jina-clip-v1');

// Load processor and vision model
const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch32');
const vision_model = await CLIPVisionModelWithProjection.from_pretrained('jinaai/jina-clip-v1');

// Run tokenization
const texts = ['A blue cat', 'A red cat'];
const text_inputs = tokenizer(texts, { padding: true, truncation: true });

// Compute text embeddings
const { text_embeds } = await text_model(text_inputs);

// Read images and run processor
const urls = [
    'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
    'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
];
const image = await Promise.all(urls.map(url => RawImage.read(url)));
const image_inputs = await processor(image);

// Compute vision embeddings
const { image_embeds } = await vision_model(image_inputs);

//  Compute similarities
console.log(cos_sim(text_embeds[0].data, text_embeds[1].data)) // text embedding similarity
console.log(cos_sim(text_embeds[0].data, image_embeds[0].data)) // text-image cross-modal similarity
console.log(cos_sim(text_embeds[0].data, image_embeds[1].data)) // text-image cross-modal similarity
console.log(cos_sim(text_embeds[1].data, image_embeds[0].data)) // text-image cross-modal similarity
console.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity