YOLOS model extremely slow

xenova / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!

https://huggingface.co/docs/transformers.js

Apache License 2.0

11.01k stars 675 forks source link

YOLOS model extremely slow #533

Open tarekziade opened 8 months ago

tarekziade commented 8 months ago

System Info

latest wasm version

Environment/Platform

[x] Website/web-app
[ ] Browser extension
[ ] Server-side (e.g., Node.js, Deno, Bun)
[ ] Desktop app (e.g., Electron)
[ ] Other (e.g., VSCode extension)

Description

I am trying to run https://huggingface.co/hustvl/yolos-tiny using a quantized version (similar to Xenova/yolos-tiny) and it works by using the object-detection pipeline but it is extremely slow.

An image that gets infered using the same model in transformers python takes around 15 seconds on my M1. The python version takes 190 ms.

I tried to run the web dev tool, and the curlpit is in the ONNX runtime at wasm-function[10863] @ ort-wasm-simd.wasm:0x801bfa but I don't have the debug symbols so it's kind of useless...

Is there a way to force transformers.js to run with a debug version of the ort runtime?

Reproduction

Runs the object detection demo at https://xenova.github.io/transformers.js/, swap the detr-resnet model with the yolo-tiny

xenova commented 8 months ago

Can you try using the unquantized version? Done by specifying:

const pipe = await pipeline('task', 'model', { quantized: false });

tarekziade commented 8 months ago

It's slightly faster,

non quantized : 15s
quantized: 20s
Maybe it's the image encoding step, I will try to measure each step

tarekziade commented 8 months ago

I tried this:

    const model_name = "Xenova/yolos-tiny";
    let model = await AutoModelForObjectDetection.from_pretrained(model_name);

    var start = Date.now();
    const processor = await AutoProcessor.from_pretrained(model_name);
    const image = await RawImage.read(imageElement.src);
    const image_inputs = await processor(image);
    var end = Date.now();

    console.log(`Image processing Execution time: ${end - start} ms.`);

    var start = Date.now();
    const { image_embeds } = await model(image_inputs);
    var end = Date.now();
    console.log(`Inference Execution time: ${end - start} ms.`);

and that gives:

Image processing Execution time: 161 ms.
Inference Execution time: 14652 ms.

here's the full session recorded with Firefox's profiler

https://share.firefox.dev/497JdCi

the function that is slow is _OrtRun in the onnx runtime. I don't think I can get more info unless I run it with symbols. Can you specify the runtime somewhere in the config? I could point one with the debug symbols

xenova commented 8 months ago

This might just be a limitation of onnxruntime-web's WASM execution provider, and can be fixed with the new WebGPU execution provider (coming soon).

@fs-eire @guschmue might be able to do more in-depth profiling.