jinaai/jina-clip-v1: support for model names with prefixes

xenova / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!

Apache License 2.0

9.71k stars 571 forks source link

Model description

jinaai/jina-clip-v1

Prerequisites

[X] The model is supported in Transformers (i.e., listed here)
[X] The model can be exported to ONNX with Optimum (i.e., listed here)

Additional information

You just added the onnx files to their HF repo, that's great! 🥳

Now that model files are getting more complex and have a prefix like text_ or vision_ (or even audio_ in the future), transformers.js needs an update as it doesn't support loading files other than model.onnx or model_quantized.onnx if see it correctly. You'll get this kind of error atm with 17.2 as it cannot locate the files with above prefixes:

Uncaught (in promise) Error: Could not locate file: "https://huggingface.co/jinaai/jina-clip-v1/resolve/main/onnx/model_quantized.onnx".
    at handleError (webpack://semanticfinder/./node_modules/@xenova/transformers/src/utils/hub.js?:248:11)
    at getModelFile (webpack://semanticfinder/./node_modules/@xenova/transformers/src/utils/hub.js?:481:24)
    at async constructSession (webpack://semanticfinder/./node_modules/@xenova/transformers/src/models.js?:451:18)
    at async Promise.all (index 1)
    at async PreTrainedModel.from_pretrained (webpack://semanticfinder/./node_modules/@xenova/transformers/src/models.js?:1121:20)
    at async AutoModel.from_pretrained (webpack://semanticfinder/./node_modules/@xenova/transformers/src/models.js?:5852:20)
    at async Promise.all (index 1)
    at async loadItems (webpack://semanticfinder/./node_modules/@xenova/transformers/src/pipelines.js?:3269:5)
    at async pipeline (webpack://semanticfinder/./node_modules/@xenova/transformers/src/pipelines.js?:3209:21)
    at async self.onmessage (webpack://semanticfinder/./src/js/worker.js?:420:24)

You're probably already working on this, but I still though it might be useful to have it documented here for anyone else looking for support.

Or is there already another way to specify the name?

Your contribution

I can gladly test!

import { AutoTokenizer, CLIPTextModelWithProjection, AutoProcessor, CLIPVisionModelWithProjection, RawImage, cos_sim } from '@xenova/transformers'; // Load tokenizer and text model const tokenizer = await AutoTokenizer.from_pretrained('jinaai/jina-clip-v1'); const text_model = await CLIPTextModelWithProjection.from_pretrained('jinaai/jina-clip-v1'); // Load processor and vision model const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch32'); const vision_model = await CLIPVisionModelWithProjection.from_pretrained('jinaai/jina-clip-v1'); // Run tokenization const texts = ['A blue cat', 'A red cat']; const text_inputs = tokenizer(texts, { padding: true, truncation: true }); // Compute text embeddings const { text_embeds } = await text_model(text_inputs); // Read images and run processor const urls = [ 'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg', 'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg' ]; const image = await Promise.all(urls.map(url => RawImage.read(url))); const image_inputs = await processor(image); // Compute vision embeddings const { image_embeds } = await vision_model(image_inputs); // Compute similarities console.log(cos_sim(text_embeds[0].data, text_embeds[1].data)) // text embedding similarity console.log(cos_sim(text_embeds[0].data, image_embeds[0].data)) // text-image cross-modal similarity console.log(cos_sim(text_embeds[0].data, image_embeds[1].data)) // text-image cross-modal similarity console.log(cos_sim(text_embeds[1].data, image_embeds[0].data)) // text-image cross-modal similarity console.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity

xenova / transformers.js