xenova / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!
https://huggingface.co/docs/transformers.js
Apache License 2.0
10.99k stars 669 forks source link

[Feature request] WebGPU support #73

Open loretoparisi opened 1 year ago

loretoparisi commented 1 year ago

WebGPU

Chrome shipped WebGPU today in Chrome 113 Beta.

Reason for request

WebGPU is currently a WIP in Firefox and Safari, in addition to the beta Chrome. Also TensorflowJS is supporting WebGPU already in several operators.

Additional context It's worth to note the Google's project Dawn, a C++ native WebGPU implementation will support NodeJS soon. WIP of Node bindings here.

xenova commented 1 year ago

Thanks for the resources :) For the most part, we are waiting for onnxruntime-web to add webgpu as a supported backend.

Here is the associated PR to track its progress:

However, we do plan to support other model formats/backends (in a similar way to how the python library supports PyTorch, tensorflow and ONNX). I don't want to spoil anything... but things are in the work 😉

gabrielgrant commented 10 months ago

AFAIU ORT has merged WebGPU support: https://github.com/microsoft/onnxruntime/issues/11695

What's needed to take advantage of this on the transformers.js side?

sroussey commented 9 months ago

For reference, the webgpu operators implemented:

https://github.com/microsoft/onnxruntime/blob/main/js/web/docs/webgpu-operators.md

gabrielgrant commented 9 months ago

Unfortunately the WebGPU implementation is currently slower than the WASM version, though: https://github.com/microsoft/onnxruntime/issues/18754#issuecomment-1859722309

Would be great to know what's needed to support WebGPU in transformers.js assuming that perf issue gets resolved at some point, but not super urgent/important at the moment

DavidGOrtega commented 8 months ago

Unfortunately the WebGPU implementation is currently slower than the WASM version,

I have some models running in jsep webGPU and are 10 times faster than wasm. I.E. clip

To me, the main problem is the current backend design: It's global (as far as I know). We should be able to setup the preferred backend to our model.

gabrielgrant commented 8 months ago

@DavidGOrtega that's great news! to be clear, are you running your models directly on ORT? or using JSEP through transformers.js somehow? would love to hear more details about exactly what your setup looks like, and which other models you've found this perf improvement on!

DavidGOrtega commented 8 months ago

Im running them with vanilla onnx.

I can do a PR to support WebGPU here (I did node), its trivial. However I think that we should rethink the backend a bit to be more flexible and be able to choose the backend and options per model. Also the onnx fallback is not perfect i.e. I have models that despite the session can be loaded the infer do not work, thats a step after the onnx fallback...

@xenova can also do webgpu and its testing it among other backends like candle. Probably not done yet just because not all the models supports wgpu?

luweigen commented 7 months ago

Unfortunately the WebGPU implementation is currently slower than the WASM version,

I have some models running in jsep webGPU and are 10 times faster than wasm. I.E. clip

To me, the main problem is the current backend design: It's global (as far as I know). We should be able to setup the preferred backend to our model.

@DavidGOrtega What model can you run?

I tried some BERT and got "cannot resolve operator 'Erf' with opsets: ai.onnx v11" with direct call of https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/esm/ort.min.js and the cache of model weights by transformers.js 2.14.2.

luweigen commented 7 months ago

I also tried the v3 branch of transformers.js and got a syntax error. It seems that the commit 66da130 was overwritten by 8c465a9. A simple fix as follows leads to other errors. It seems still a long way to go?

28f666d
xenova commented 7 months ago

@luweigen the v3 branch is still a work-in-progress, and will be marked as non-draft when reading for testing 👍

beaufortfrancois commented 7 months ago

@xenova May you share with us what is now blocking transformers.js to take advantage of WebGPU? I think we're pretty much all excited to be able to try it and compare performances (CPU vs GPU). Thank you! ❤️

luweigen commented 7 months ago

@xenova May you share with us what is now blocking transformers.js to take advantage of WebGPU? I think we're pretty much all excited to be able to try it and compare performances (CPU vs GPU). Thank you! ❤️

I wrote a blog to remix transformers.js and onnxruntime webgpu https://medium.com/@GenerationAI/transformers-js-onnx-runtime-webgpu-46c3e58d547c and a little bit of comparison of CPU vs GPU https://medium.com/@GenerationAI/performance-of-onnxruntime-webgpu-44a25d9897a9 some functions are adapted from transformers.js to make it work as mentioned in the code comments.

loretoparisi commented 7 months ago

@luweigen thanks for this post. The cpu to WebGPU comparison is fair, but not all the results are obvious. In the tests you declare:

Execution time: 6169.100000ms
Batch Execution time: 23191.899999ms

WebGPU Execution time: 20445.0999994ms
WegGPU Batch Execution time: 2231 ms

hence for processing a batch of size ~ 100 you get a cpu/webgpu ratio of ~10x i.e. a clear WebGPU speedup.

But when the inference is just one sequence you have the cpu/webgpu ~=0.3 i.e. this results to a ~3.3x of cpu over WebGPU so it seems that offloading to the GPU it's not that efficient with a batch size = 1. So according to your tests with MiniLM, when WebGPU becomes useful, in other words for which batch size the ratio cpu/webgpu is > 0?

luweigen commented 7 months ago

@luweigen thanks for this post. The cpu to WebGPU comparison is fair, but not all the results are obvious. In the tests you declare:

Execution time: 6169.100000ms
Batch Execution time: 23191.899999ms

WebGPU Execution time: 20445.0999994ms
WegGPU Batch Execution time: 2231 ms

hence for processing a batch of size ~ 100 you get a cpu/webgpu ratio of ~10x i.e. a clear WebGPU speedup.

But when the inference is just one sequence you have the cpu/webgpu ~=0.3 i.e. this results to a ~3.3x of cpu over WebGPU so it seems that offloading to the GPU it's not that efficient with a batch size = 1. So according to your tests with MiniLM, when WebGPU becomes useful, in other words for which batch size the ratio cpu/webgpu is > 0?

all-MiniLM-L6-v2 is very small. CPU can handle well enough if batch-size is also small. I guess in larger model we can see the advantage of GPU in small batch-size too. This was a very preliminary version of the code therefore not shared in GitHub yet, but will be, with more test results on other models and hyperparameters. I/O binding to GPU is not implemented yet but overall improvement won't be very much, i guess.

josephrocca commented 4 months ago

But when the inference is just one sequence you have the cpu/webgpu ~=0.3 i.e. this results to a ~3.3x of cpu over WebGPU so it seems that offloading to the GPU it's not that efficient with a batch size = 1.

FWIW, even with batch size = 1, I get a 5x speedup for the WebGPU backend on bge-base-en-v1.5 according to Xenova's excellent webgpu-embedding-benchmark. Note that this model is 109M params - i.e. about 5x larger than all-MiniLM-L6-v2, but it can still embed a couple of passages per second on my Android phone even with the Wasm backend, and is "only" ~100mb 8-bit quantized (fine for my use case).

5x is certainly worth it for me! Really looking forward to the WebGPU backend stabilizing (and hoping Chrome team gets Linux WebGPU sorted soon 🤞 - also, looks like Safari isn't tooo far away from a decent/stable WebGPU release, surprisingly).