xenova / transformers.js

State-of-the-art Machine Learning for the web. Run πŸ€— Transformers directly in your browser, with no need for a server!
https://huggingface.co/docs/transformers.js
Apache License 2.0
9.71k stars 571 forks source link

Long running transcription using webgpu-whisper #802

Open iamhitarth opened 2 weeks ago

iamhitarth commented 2 weeks ago

Question

Noob question - the webgpu-whisper demo does real time transcription, however it doesn't build out a full transcript from the start ie. 2 mins into transcription, the first few transcribed lines disappear.

Transcript at time x πŸ‘‡

Cool, let's test this out. We'll see how this works. So turns out that the transcription when I try to access it is actually just empty. And so the only thing that actually comes through is. So yeah, so the output that's getting cut is basically coming from the

Transcript at time x+1 πŸ‘‡

this out, we'll see how this works. So turns out that the transcription when I try to access it is actually just empty. And so the only thing that actually comes through is. So yeah, so the output that's getting cut is basically coming from the work

Note how the "Cool, let's test" is missing from the start of the second transcript.

I'm wondering what it would take to keep building the transcript for a long running meeting without losing any of the previously transcribed stuff?

I tried a naive appending approach and that just results in a transcript full of repetition.

So I'm very curious about what it would take to build out a streaming transcription similar to what something like Deepgram would offer. Would that require a change to the pipeline? Are there models that can take an appended transcript with lots of repetition and trim it down to a clean transcript?

Please let me know if my questions are unclear. Just looking for some direction so that I can potentially put up a PR for this (if needed).

xenova commented 1 week ago

Hi there πŸ‘‹ Indeed, that demo only considers the latest 30 seconds of audio, and was more to showcase the ability of the model to run in real-time with WebGPU. The rest of the pipeline should be implemented by the user, since this is out-of-scope for the transformers.js library (at least for now). I suggest you take a look at this paper, which details a nice way of doing this.

Hope that helps!