Open GianlucaIavicoli opened 2 weeks ago
Hi there 👋 Can you try using the full precision model with:
const transcriber = await pipeline(task, model, { quantized: false });
? Thanks!
I'll try right now, thanks for the fast response.
I tried the provided solution, but I encountered the following errors during execution:
std::bad_alloc Error:
std::bad_alloc ort-wasm-simd-threaded.wasm:0x818b78
OrtRUn Error:
An error occurred during model execution: "Error: failed to call OrtRun(). error code = 6."
This type of error also occurs when using the Whisper medium or higher models, such as Whisper large. Would it be better to try using the experimental WebGPU branch to potentially resolve these errors?
Could you please provide further guidance on resolving these errors?
I have carried out several tests with webgpu but it always goes out of memory. My question: is it really possible that I can't use the small model on my RTX 3050 or am I doing something wrong and need to optimize somewhere? I checked directly using "nvidia-smi" and it actually goes above the 4GB of VRAM available
Is it possible to get word level timestamps via webgpu on an onnx model? I haven't had any luck getting it to run. Would love to see an example if one exists
@xenova i think maybe we can refer to WhisperX that use "wav2vec2-base-960h" to get more accurate timestamp
System Info
Framework: Vuejs 3 Transformer.js version: 2.17.1
Environment/Platform
Description
Description
I am using the
transformer.js
library for automatic speech recognition (ASR) to transcribe audio from a video. While attempting to get word-level timestamps, I noticed that the timestamps for some words are inaccurately extended, particularly when there are pauses in the audio. Below is a detailed description of the problem and the related code.Code Snippet
Problem
When using return_timestamps: 'word', the word timestamps are inaccurately extended in the presence of pauses. For example:
Here, the word "freedom" starts at around 4s and ends at 6s, which is incorrect because there's a pause in the video. Whisper makes the word duration longer than it is.
Conversely, using return_timestamps: true for sentence-level timestamps works correctly:
Reproduction