I was attempting to run the experimental WebGPU version of Whisper with word-level timestamps and encountered a couple of issues in the current implementation. In the _call_whisper method of the AutomaticSpeechRecognition pipeline, data.token_timestamps aren't generated yet, leading to an exception:
Additionally, there is no data.sequences in the output of the generate function, which causes this error:
TypeError: Cannot read properties of undefined (reading '0')
at Function._call_whisper (@xenova_transformers.js?v=b22fd6f2:26944:40)
at async transcribe (processClip.js:57:18)
at async self.onmessage (processClip.js:69:22)
I'm wondering what might be missing to use word-level timestamps with WebGPU, considering the default implementation with WASM works fine.
Hi @xenova,
I was attempting to run the experimental WebGPU version of Whisper with word-level timestamps and encountered a couple of issues in the current implementation. In the
_call_whisper
method of the AutomaticSpeechRecognition pipeline,data.token_timestamps
aren't generated yet, leading to an exception:Additionally, there is no
data.sequences
in the output of thegenerate
function, which causes this error:I'm wondering what might be missing to use word-level timestamps with WebGPU, considering the default implementation with WASM works fine.