Closed mr-sarthakgupta closed 2 weeks ago
Can you include the code you are running? You may need to update the dtype
to one which produces better results.
I tried the following ways:
const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-large-patch14');
const vision_model = await CLIPVisionModelWithProjection.from_pretrained('Xenova/clip-vit-large-patch14', {
device: 'webgpu',
dtype: 'fp16',
});
const image = await RawImage.read(url);
const time_start = performance.now();
const image_inputs = await processor(image);
const { image_embeds } = await vision_model(image_inputs);
This took 17371 ms with fp16 and 17353 ms with fp32.
Second method I tried was:
const vision_model = await pipeline('image-feature-extraction', 'Xenova/clip-vit-large-patch14', {
device: 'webgpu',
dtype: 'fp16',
});
const image_embeds = await vision_model(url);
At fp16 this took 17549 ms and at fp32 it took 16576 ms.
while without the webgpu using:
const vision_model = await pipeline('image-feature-extraction', 'Xenova/clip-vit-large-patch14', {
dtype: 'fp16',
});
const image_embeds = await vision_model(url);
I got the forward pass in 16753 ms.
Even for batch size=1 I obtained huge improvement on speed using my device:
Hmm, strange. Your code looks right. 🤔
Could you try this demo: https://huggingface.co/spaces/Xenova/webgpu-clip? It should be real-time CLIP with WebGPU. You can also try a smaller CLIP model like https://huggingface.co/Xenova/clip-vit-base-patch32, maybe the large one has some issues with the ONNX export.
https://github.com/xenova/transformers.js/assets/26504141/75a4ab6f-41f2-4a00-9967-3cd7dcaa801e
Unfortunately I need the projection dimensions to be 768 which is only true for the large model. It's really strange indeed, the demo is working perfectly fine on my device too having ~5 FPS.
Would it be possible to try the large model in the demo to see if it's an issue specific to the large model? Also, which model does this demo use?
Edit: Just tried the model https://huggingface.co/Xenova/clip-vit-base-patch32 and found the same trend, no significant change in inference speed with or without the WebGPU
Hi @xenova could it have something to do with the fact that I'm using the model in a browser-extension? Also, would it be possible for you to provide the code for the webgpu-clip demo?
could it have something to do with the fact that I'm using the model in a browser-extension?
Hmm, good question. Just to confirm, are you sure you've installed Transformers.js v3 from the dev branch with:
npm install xenova/transformers.js#v3
? You might still be using v2.
Also, would it be possible for you to provide the code for the webgpu-clip demo?
Sure - here's the source code: https://github.com/xenova/transformers.js/tree/v3/examples/webgpu-clip
Hmm, good question. Just to confirm, are you sure you've installed Transformers.js v3 from the dev branch with:
This is it! the models are running on the same speed as the demos now. Thanks for the help!
Question
I want to use the model: https://huggingface.co/Xenova/clip-vit-large-patch14 with WebGPU for quick inference in the browser. I ran the WebGPU benchmark to observe the performance increase and indeed it showed a ~7x improvement in speed on my device.
But when I run the clip model linked above, there's barely any difference between performance with and without WebGPU.