WebGPU pipeline supporting word-level timestamps

Hi @xenova,

I was attempting to run the experimental WebGPU version of Whisper with word-level timestamps and encountered a couple of issues in the current implementation. In the _call_whisper method of the AutomaticSpeechRecognition pipeline, data.token_timestamps aren't generated yet, leading to an exception:

const data = await this.model.generate({
          inputs: chunk.input_features,
          ...kwargs
});
if (return_timestamps === "word") {
          chunk.tokens = data.sequences[0].tolist();
          chunk.token_timestamps = data.token_timestamps.tolist()[0].map(
            (x) => round(x, 2)
          );
} else {
  chunk.tokens = data[0].tolist();
}

Additionally, there is no data.sequences in the output of the generate function, which causes this error:

TypeError: Cannot read properties of undefined (reading '0')
    at Function._call_whisper (@xenova_transformers.js?v=b22fd6f2:26944:40)
    at async transcribe (processClip.js:57:18)
    at async self.onmessage (processClip.js:69:22)

I'm wondering what might be missing to use word-level timestamps with WebGPU, considering the default implementation with WASM works fine.

xenova / whisper-web

WebGPU pipeline supporting word-level timestamps #41