Open iamhitarth opened 2 weeks ago
Hi there π Indeed, that demo only considers the latest 30 seconds of audio, and was more to showcase the ability of the model to run in real-time with WebGPU. The rest of the pipeline should be implemented by the user, since this is out-of-scope for the transformers.js library (at least for now). I suggest you take a look at this paper, which details a nice way of doing this.
Hope that helps!
Question
Noob question - the webgpu-whisper demo does real time transcription, however it doesn't build out a full transcript from the start ie. 2 mins into transcription, the first few transcribed lines disappear.
Transcript at time x π
Transcript at time x+1 π
Note how the "Cool, let's test" is missing from the start of the second transcript.
I'm wondering what it would take to keep building the transcript for a long running meeting without losing any of the previously transcribed stuff?
I tried a naive appending approach and that just results in a transcript full of repetition.
So I'm very curious about what it would take to build out a streaming transcription similar to what something like Deepgram would offer. Would that require a change to the pipeline? Are there models that can take an appended transcript with lots of repetition and trim it down to a clean transcript?
Please let me know if my questions are unclear. Just looking for some direction so that I can potentially put up a PR for this (if needed).