Open burke-up opened 1 month ago
PR for the issue you mentioned:
https://github.com/microsoft/onnxruntime/pull/20165#issue-2217881630
Looks like it's at: onnxruntime-web@1.19.0-esmtest.20240513-a16cd2bd21
Even V3 branch is still at 1.18.0
and it also contains still references to 1.17.1
:
env.backends.onnx.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/onnxruntime-web@1.17.1/dist/';
env.backends.onnx.wasm.numThreads = 1;
Based on semver I think you should be able to just install the 1.19
package, but the PR sounded like a little bigger refactor, so I wouldn't be surprised if we need more changes for this. You wanna test this and report?
latest v3 branch is using "onnxruntime-web": "1.19.0-dev.20240804-ee2fe87e2d",
and I believe WebAssembly multi-thread should be already supported in this version (unless you use in CSP restricted environment, eg. ServiceWorker). Please have a try and let me know if it's still not working.
Yes, your PR was merged a day after I mentioned it: https://github.com/xenova/transformers.js/commit/437cb34e50a27e40237c3a8eee3527b2db459d58
And the wasmPaths
referring to 1.17.1
are still not updated, which may cause issues with those two examples regarding numThreads
.
I wanted to try it, but the amount of magic spells you have to cast (aka Chrome “security” requirements) to make it work is hilarious.
I enabled this option:
And I don't get a SharedArrayBuffer
anyway:
And ONNX only seems to test self.crossOriginIsolated
which is false
without wasting more time on random server settings.
Edit: I figured it out and it seems to work for me, the last missing step on my Linux system was to use this command line option: ./run.sh --enable-features=SharedArrayBuffer
run.sh
is from https://github.com/scheib/chromium-latest-linux and ensures to run the latest Chromium.
I tested via this simple script:
// window.self = {crossOriginIsolated: true};
const {env, pipeline} = await import('https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.0.0-alpha.5/dist/transformers.min.js');
// env.backends.onnx.wasm.numThreads = 1;
// E.g. 32 or whatever CPU you have.
env.backends.onnx.wasm.numThreads = navigator.hardwareConcurrency;
window.env = env;
const speaker_embeddings = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/speaker_embeddings.bin';
const synthesizer = await pipeline('text-to-speech', 'Xenova/speecht5_tts', {
// The quantized version outputs audio distortions, so we deactivate it:
quantized: false,
// https://github.com/xenova/whisper-web/tree/experimental-webgpu
// https://github.com/xenova/whisper-web/commit/1c52c346e0e4a7c6b6ad39791f2df3d101eca3c3
// device: 'webgpu',
});
function getText() {
return 'Hello world, don\'t count the days, make the days count. Wubba lubba dub dub!';
}
function newButton(innerText, onClick) {
const button = document.createElement('button');
button.innerText = innerText;
button.onclick = onClick;
document.body.append(button);
}
async function benchmark() {
const start = Date.now();
const data = await synthesizer(getText(), {speaker_embeddings});
const end = Date.now();
const delta = end - start;
console.log('Took', delta / 1000, 'ms using ', env.backends.onnx.wasm.numThreads, 'threads to generate data:', data);
}
newButton("benchmark()", () => {
benchmark();
});
// This doesn't seem to affect anything afterwards:
newButton("num.threads = 1", () => {
env.backends.onnx.wasm.numThreads = 1;
benchmark();
});
newButton("num.threads = 2", () => {
env.backends.onnx.wasm.numThreads = 2;
benchmark();
});
newButton("num.threads = 3", () => {
env.backends.onnx.wasm.numThreads = 3;
benchmark();
});
newButton("num.threads = 4", () => {
env.backends.onnx.wasm.numThreads = 4;
benchmark();
});
Using only one core, the task takes 5-6 seconds. Using 32 cores the task takes 1-2 seconds. Kinda funny in itself, but I assume there is a lot of overhead somewhere or the task is just too small. Maybe other people can test and share their results.
Cool, nice experiment. And it's exactly the model I was hoping to speed up, as the rest of my voice-that-with-an-AI pipeline, namely Whisper and LLM's, enjoys some level of GPU speedup.
Would this work without casting magic spells on MacOS or Windows? Or is this still behind a flag on all browsers?
According to Can I Use sharedArrayBuffers are widely implemented?
Would this work without casting magic spells on MacOS or Windows? Or is this still behind a flag on all browsers?
If you send these two headers via PHP or some other server setup to enable cross-origin isolation it might work on all OS'es (I just couldn't be bothered to spawn a special server just for a quick test). But yea, why is it even still "experimental" then? Only a test will tell, that you might wanna do?
Yes, I'm already on it. I'm sure I have cross-origin isolation through htaccess
:
Header set Cross-Origin-Embedder-Policy "credentialless"
Header set Cross-Origin-Opener-Policy "same-origin"
Header set Cross-Origin-Resource-Policy: "cross-origin"
AddType application/wasm wasm
It seems the only change I need to make to my code is to add this line?
env.backends.onnx.wasm.numThreads = WorkerNavigator.hardwareConcurrency;
Which would mean that by default Transformers.js only uses one CPU thread for any WASM task? Is that really true? Is there a way to check how many threads Transformers.js is actually using?
Fun fact: I believe you can create cross-origin isolation via a service worker, no server-side adjustments needed: https://github.com/gzuidhof/coi-serviceworker
OK, I've run a test.
It seems that by default Transformers.js already uses multi-threading. What's more, it seems to already have picked the optimum number (which I often read is the number of threads divided by two). This bears out in the tests:
Text: I can turn any sentence or document that you provide into speech.
1 -> 9 seconds 2 -> 5.7 seconds 4 -> 3.5 seconds (automatic setting) 8 -> 4.5 seconds (my hardwareConcurrency)
Details: This is on a Macbook Pro M1 with 16Gb of ram, using the Brave browser.
1 thread
2 threads
4 threads
8 threads
My settings:
static async getInstance(progress_callback = null) {
if (this.tokenizer_instance === null) {
this.tokenizer = AutoTokenizer.from_pretrained(this.model_id, { progress_callback });
}
if (this.model_instance === null) {
this.model_instance = SpeechT5ForTextToSpeech.from_pretrained(this.model_id, {
//quantized: false,
dtype: 'fp32',
quantized:true,
progress_callback,
device:self.device,
});
}
if (this.vocoder_instance === null) {
this.vocoder_instance = SpeechT5HifiGan.from_pretrained(this.vocoder_id, {
//quantized: false,
dtype: 'fp32',
quantized:true,
progress_callback,
device:self.device,
});
}
return new Promise(async (resolve, reject) => {
const result = await Promise.all([
this.tokenizer,
this.model_instance,
this.vocoder_instance,
]);
self.postMessage({
status: 'ready',
});
resolve(result);
});
}
I'll check if quantization makes a difference. They all sounded fine to me already.
Did some voice quality testing.
I don't hear any difference really. In fact, the quantized one sounds a little better to me :-D
Feature request
// Due to a bug in onnxruntime-web, we must disable multithreading for now. // See [Web] chrome V3 extension
TypeError: URL.createObjectURL is not a function
· Issue #14445 · micros for more information. env.backends.onnx.wasm.numThreads = 1;the issue of https://github.com/microsoft/onnxruntime/issues/14445 was closed. Is it possible to enable multi-threading for onnxruntime-web with WASM?
Motivation
update the onnxruntime-web version to support the multithreads
Your contribution
no