Inaccurate Word Timestamps in ASR Transcription

GianlucaIavicoli commented 2 weeks ago

System Info

Framework: Vuejs 3 Transformer.js version: 2.17.1

Environment/Platform

[X] Website/web-app
[ ] Browser extension
[ ] Server-side (e.g., Node.js, Deno, Bun)
[ ] Desktop app (e.g., Electron)
[ ] Other (e.g., VSCode extension)

Description

I am using the transformer.js library for automatic speech recognition (ASR) to transcribe audio from a video. While attempting to get word-level timestamps, I noticed that the timestamps for some words are inaccurately extended, particularly when there are pauses in the audio. Below is a detailed description of the problem and the related code.

Code Snippet

async function transcribe(audio) {
  const task = 'automatic-speech-recognition';
  const model = 'Xenova/whisper-small.en';

  const transcriber = await pipeline(task, model);

  const chunksLength = 30;
  const strideLength = 5;

  const output = await transcriber(audio, {
    return_timestamps: 'word',
    chunk_length_s: chunksLength,
    stride_length_s: strideLength,
    top_k: 0,
    do_sample: false,
  });
  return output;
}

Problem

When using return_timestamps: 'word', the word timestamps are inaccurately extended in the presence of pauses. For example:

[
  {
    "text": " Lois's",
    "timestamp": [3.36, 3.94]
  },
  {
    "text": " freedom.",
    "timestamp": [3.94, 6.02]
  },
  {
    "text": " You",
    "timestamp": [6.02, 7]
  }
]

Here, the word "freedom" starts at around 4s and ends at 6s, which is incorrect because there's a pause in the video. Whisper makes the word duration longer than it is.

Conversely, using return_timestamps: true for sentence-level timestamps works correctly:

[
  {
    "text": " I'll surrender. But only if you guarantee Lois's freedom.",
    "timestamp": [1.76, 4.32]
  },
  {
    "text": " You let them handcuff you?",
    "timestamp": [6.82, 8.32]
  }
]

Reproduction

Extract the audio from the video -> https://storage.googleapis.com/test_saas/user-6640c09d732deba3ffb776dd/66634cf38a6b941fb67118ea/clips/test.mp4.
Use the provided code to transcribe the audio with return_timestamps: 'word' && true.
Observe the incorrect word-level timestamps in the output.

xenova commented 2 weeks ago

Hi there 👋 Can you try using the full precision model with:

const transcriber = await pipeline(task, model, { quantized: false });

? Thanks!

GianlucaIavicoli commented 2 weeks ago

I'll try right now, thanks for the fast response.

GianlucaIavicoli commented 2 weeks ago

I tried the provided solution, but I encountered the following errors during execution:

std::bad_alloc Error:

std::bad_alloc  ort-wasm-simd-threaded.wasm:0x818b78

OrtRUn Error:

An error occurred during model execution: "Error: failed to call OrtRun(). error code = 6."

Question

This type of error also occurs when using the Whisper medium or higher models, such as Whisper large. Would it be better to try using the experimental WebGPU branch to potentially resolve these errors?

Could you please provide further guidance on resolving these errors?

GianlucaIavicoli commented 2 weeks ago

I have carried out several tests with webgpu but it always goes out of memory. My question: is it really possible that I can't use the small model on my RTX 3050 or am I doing something wrong and need to optimize somewhere? I checked directly using "nvidia-smi" and it actually goes above the 4GB of VRAM available

decoder-sh-david commented 2 weeks ago

Is it possible to get word level timestamps via webgpu on an onnx model? I haven't had any luck getting it to run. Would love to see an example if one exists

paul726 commented 6 days ago

@xenova i think maybe we can refer to WhisperX that use "wav2vec2-base-960h" to get more accurate timestamp

xenova / transformers.js