xenova / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!
https://huggingface.co/docs/transformers.js
Apache License 2.0
10.86k stars 657 forks source link

Whisper model word-level timestamps broken #551

Open BjoernRave opened 7 months ago

BjoernRave commented 7 months ago

System Info

"@xenova/transformers": "^2.14.0",

macbook with M2 chip and MacOs Sonoma

Node.js: 20.11.0

Environment/Platform

Description

I am running whisper like this:

export const speechToText = async (audio: Buffer) => {
  const float32Array = await convertAudioToFloat32Array(audio)
  env.allowLocalModels = false

  const transcriber = await pipeline(
    "automatic-speech-recognition",
    "Xenova/whisper-large-v3",
  )
  const output = await transcriber(float32Array, {
    return_timestamps: "word",
  })

  return output
}

However the returned word-level timestamps are all equal to the total duration of the audio file.

During the run my console also gets flooded with this kind of logs:

2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230677 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.20/self_attn_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230684 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.12/self_attn_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230693 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.6/final_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230701 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.11/encoder_attn_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230708 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.4/encoder_attn_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230716 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.1/self_attn_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230741 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.0/final_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230761 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.10/encoder_attn_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230773 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.8/self_attn_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230782 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.3/final_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.230 node[87226:3900219] 2024-01-30 15:14:37.230797 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.6/self_attn_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.231 node[87226:3900219] 2024-01-30 15:14:37.230806 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.2/self_attn_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.231 node[87226:3900219] 2024-01-30 15:14:37.230814 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.0/self_attn_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.231 node[87226:3900219] 2024-01-30 15:14:37.230844 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.17/self_attn_layer_norm/Constant_1_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.231 node[87226:3900219] 2024-01-30 15:14:37.230855 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.19/encoder_attn_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.231 node[87226:3900219] 2024-01-30 15:14:37.230865 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.0/encoder_attn_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.
2024-01-30 15:14:37.231 node[87226:3900219] 2024-01-30 15:14:37.230873 [W:onnxruntime:, graph.cc:3490 CleanUnusedInitializersAndNodeArgs] Removing initializer '/model/decoder/layers.5/final_layer_norm/Constant_output_0'. It is not used by any node and should be removed from the model.

There is a releated PR in the python project: https://github.com/huggingface/transformers/pull/25607

Reproduction

  1. Call whisper with return_timestamps: "word"
  2. Inspect output
xenova commented 7 months ago

Call whisper with return_timestamps: "word" Inspect output

Could you please provide a link to the audio file tested?

wobbble commented 5 months ago

Hey @xenova Really big thanks for awesome project. I also have wrong timestamps issue. From my tests looks like stride param change fix it, but maybe it's deeper issue.

Whisper web with only return_timestamps: "word",

Screenshot 2024-04-03 at 10 26 14

Whisper web with word level and fixed valuestride_length_s=3 at worker.js - line 160 instead of stride_length_s: 3, //isDistilWhisper ? 3 : 5,

Screenshot 2024-04-03 at 10 19 20

Codesandbox link with changes that fix timestamp

Attaching audio file with which I have tested output.wav.zip

Thanks and have a great day!