xenova / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!
https://huggingface.co/docs/transformers.js
Apache License 2.0
10.95k stars 667 forks source link

update the onnxruntime-web version to support the multithreads #882

Open burke-up opened 1 month ago

burke-up commented 1 month ago

Feature request

// Due to a bug in onnxruntime-web, we must disable multithreading for now. // See [Web] chrome V3 extension TypeError: URL.createObjectURL is not a function · Issue #14445 · micros for more information. env.backends.onnx.wasm.numThreads = 1;

the issue of https://github.com/microsoft/onnxruntime/issues/14445 was closed. Is it possible to enable multi-threading for onnxruntime-web with WASM?

Motivation

update the onnxruntime-web version to support the multithreads

Your contribution

no

kungfooman commented 1 month ago

PR for the issue you mentioned:

https://github.com/microsoft/onnxruntime/pull/20165#issue-2217881630

Looks like it's at: onnxruntime-web@1.19.0-esmtest.20240513-a16cd2bd21

Even V3 branch is still at 1.18.0 and it also contains still references to 1.17.1:

env.backends.onnx.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/onnxruntime-web@1.17.1/dist/';
env.backends.onnx.wasm.numThreads = 1;

Based on semver I think you should be able to just install the 1.19 package, but the PR sounded like a little bigger refactor, so I wouldn't be surprised if we need more changes for this. You wanna test this and report?

fs-eire commented 1 month ago

latest v3 branch is using "onnxruntime-web": "1.19.0-dev.20240804-ee2fe87e2d", and I believe WebAssembly multi-thread should be already supported in this version (unless you use in CSP restricted environment, eg. ServiceWorker). Please have a try and let me know if it's still not working.

kungfooman commented 4 weeks ago

Yes, your PR was merged a day after I mentioned it: https://github.com/xenova/transformers.js/commit/437cb34e50a27e40237c3a8eee3527b2db459d58

And the wasmPaths referring to 1.17.1 are still not updated, which may cause issues with those two examples regarding numThreads.

I wanted to try it, but the amount of magic spells you have to cast (aka Chrome “security” requirements) to make it work is hilarious.

I enabled this option:

image

And I don't get a SharedArrayBuffer anyway:

image

And ONNX only seems to test self.crossOriginIsolated which is false without wasting more time on random server settings.

Edit: I figured it out and it seems to work for me, the last missing step on my Linux system was to use this command line option: ./run.sh --enable-features=SharedArrayBuffer

run.sh is from https://github.com/scheib/chromium-latest-linux and ensures to run the latest Chromium.

I tested via this simple script:

// window.self = {crossOriginIsolated: true};
const {env, pipeline}  = await import('https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.0.0-alpha.5/dist/transformers.min.js');
// env.backends.onnx.wasm.numThreads = 1;
// E.g. 32 or whatever CPU you have.
env.backends.onnx.wasm.numThreads = navigator.hardwareConcurrency;
window.env = env;
const speaker_embeddings = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/speaker_embeddings.bin';
const synthesizer = await pipeline('text-to-speech', 'Xenova/speecht5_tts', {
  // The quantized version outputs audio distortions, so we deactivate it:
  quantized: false,
  // https://github.com/xenova/whisper-web/tree/experimental-webgpu
  // https://github.com/xenova/whisper-web/commit/1c52c346e0e4a7c6b6ad39791f2df3d101eca3c3
  // device: 'webgpu',
});
function getText() {
  return 'Hello world, don\'t count the days, make the days count. Wubba lubba dub dub!';
}
function newButton(innerText, onClick) {
  const button = document.createElement('button');
  button.innerText = innerText;
  button.onclick = onClick;
  document.body.append(button);
}
async function benchmark() {
  const start = Date.now();
  const data = await synthesizer(getText(), {speaker_embeddings});
  const end = Date.now();
  const delta = end - start;
  console.log('Took', delta / 1000, 'ms using ', env.backends.onnx.wasm.numThreads, 'threads to generate data:', data);
}
newButton("benchmark()", () => {
  benchmark();
});
// This doesn't seem to affect anything afterwards:
newButton("num.threads = 1", () => {
  env.backends.onnx.wasm.numThreads = 1;
  benchmark();
});
newButton("num.threads = 2", () => {
  env.backends.onnx.wasm.numThreads = 2;
  benchmark();
});
newButton("num.threads = 3", () => {
  env.backends.onnx.wasm.numThreads = 3;
  benchmark();
});
newButton("num.threads = 4", () => {
  env.backends.onnx.wasm.numThreads = 4;
  benchmark();
});

Using only one core, the task takes 5-6 seconds. Using 32 cores the task takes 1-2 seconds. Kinda funny in itself, but I assume there is a lot of overhead somewhere or the task is just too small. Maybe other people can test and share their results.

flatsiedatsie commented 4 weeks ago

Cool, nice experiment. And it's exactly the model I was hoping to speed up, as the rest of my voice-that-with-an-AI pipeline, namely Whisper and LLM's, enjoys some level of GPU speedup.

Would this work without casting magic spells on MacOS or Windows? Or is this still behind a flag on all browsers?

According to Can I Use sharedArrayBuffers are widely implemented?

kungfooman commented 4 weeks ago

Would this work without casting magic spells on MacOS or Windows? Or is this still behind a flag on all browsers?

If you send these two headers via PHP or some other server setup to enable cross-origin isolation it might work on all OS'es (I just couldn't be bothered to spawn a special server just for a quick test). But yea, why is it even still "experimental" then? Only a test will tell, that you might wanna do?

flatsiedatsie commented 4 weeks ago

Yes, I'm already on it. I'm sure I have cross-origin isolation through htaccess:

Header set Cross-Origin-Embedder-Policy "credentialless"
Header set Cross-Origin-Opener-Policy "same-origin"
Header set Cross-Origin-Resource-Policy: "cross-origin"

AddType application/wasm wasm

It seems the only change I need to make to my code is to add this line?

env.backends.onnx.wasm.numThreads = WorkerNavigator.hardwareConcurrency;

Which would mean that by default Transformers.js only uses one CPU thread for any WASM task? Is that really true? Is there a way to check how many threads Transformers.js is actually using?

Fun fact: I believe you can create cross-origin isolation via a service worker, no server-side adjustments needed: https://github.com/gzuidhof/coi-serviceworker

flatsiedatsie commented 4 weeks ago

OK, I've run a test.

It seems that by default Transformers.js already uses multi-threading. What's more, it seems to already have picked the optimum number (which I often read is the number of threads divided by two). This bears out in the tests:

Text: I can turn any sentence or document that you provide into speech.

1 -> 9 seconds 2 -> 5.7 seconds 4 -> 3.5 seconds (automatic setting) 8 -> 4.5 seconds (my hardwareConcurrency)

Details: This is on a Macbook Pro M1 with 16Gb of ram, using the Brave browser.

1 thread

Screenshot 2024-08-15 at 13 56 07

2 threads

Screenshot 2024-08-15 at 14 02 15

4 threads

Screenshot 2024-08-15 at 13 54 25

8 threads

Screenshot 2024-08-15 at 13 52 59

My settings:

static async getInstance(progress_callback = null) {
        if (this.tokenizer_instance === null) {
            this.tokenizer = AutoTokenizer.from_pretrained(this.model_id, { progress_callback });
        }

        if (this.model_instance === null) {
            this.model_instance = SpeechT5ForTextToSpeech.from_pretrained(this.model_id, {
                //quantized: false,
                dtype: 'fp32',
                quantized:true,
                progress_callback,
                device:self.device,
            });
        }
        if (this.vocoder_instance === null) {
            this.vocoder_instance = SpeechT5HifiGan.from_pretrained(this.vocoder_id, {
                //quantized: false,
                dtype: 'fp32',
                quantized:true,
                progress_callback,
                device:self.device,
            });
        }

        return new Promise(async (resolve, reject) => {
            const result = await Promise.all([
                this.tokenizer,
                this.model_instance,
                this.vocoder_instance,
            ]);
            self.postMessage({
                status: 'ready',
            });
            resolve(result);
        });
    }

I'll check if quantization makes a difference. They all sounded fine to me already.

flatsiedatsie commented 4 weeks ago

Did some voice quality testing.

Canadian_voice_test.zip

I don't hear any difference really. In fact, the quantized one sounds a little better to me :-D