Speaker timestamping sometimes fails

flatsiedatsie commented 1 month ago

System Info

Macbook Pro M1 Transformers.js V3

Environment/Platform

[X] Website/web-app
[ ] Browser extension
[ ] Server-side (e.g., Node.js, Deno, Bun)
[ ] Desktop app (e.g., Electron)
[ ] Other (e.g., VSCode extension)

Description

I've been trying to transcribe the presidential debate between Trump and Biden. I'm using the following V3 WebGPU pipeline:

Whisper with timestamping (whisper-medium.en_timestamp)
Speaker diarization (in a single promise pass, based on the example code)

followed by

Speaker verification (512 vector)

This generally works well if I pass in simple, spaced out commands.

But Trump and Biden are talking fast and without much silence in between their sentences, so the VAD is sometimes forced to cut the recording buffer into chunks, and send that to Whisper early. I'm currently doing that if the recording gets longer than 8 seconds.

The 8-second chunks are transcribed just fine. However, their timestamps are incorrect. They are all 18.02. If there are multiple segments (multiple speakers), then assigning text to a specific speaker becomes tricky.

What could be causing this?

I tried to see if there is some "the audio array has to be a precise multiple of" requirement for Whisper timestamped, but could not spot any.

Reproduction

I'm not entirely sure yet, as it happens intermittently.

I wonder if it happens when it's given an audio array that starts mid-sentence. I've also tried padding the ends of the audio array with zeros, but that resulted in errors.

I've now modified the VAD to wait with cutting off the audio until it detects a single audio frame of silence. This may have led to some improvement, I'm testing that now.

flatsiedatsie commented 1 month ago

After more testing I think it was related to feeding it audio that didn't start with silence. While it still fails sometimes, it does so way less frequently now.

flatsiedatsie commented 4 weeks ago

I've learnt a bit more about how things interact between the diarization pipeline and whisper timestamped. This is because I've added the ability to transcribe audio and video files, which has greatly simplified testing, and has made it very repeatable.

With a 4 minute test cut of the Trump-Biden debate this is what I'm seeing:

1. Even with one long continuous audio array the timestamping issue still occurs. More importantly, I now understand what's going on. It happens when the model tries to place sentences at the edge of segments. The issue is basically that the segments and the sentence timestamps initially don't agree with there they end.

Squeezing

Let's say it 'wants' the sentence to fit in segment 3, but the sentence would run over into the next segment, then it tries to 'sqeeze' the sentence data to fit inside the remaining time available in the segment. While all the sentence's timestamps will be affected, this happens more towards the latter words of the sentence. In this case the timestamps of the last word will almost certainly be the same, giving the words a duration of zero.

Elongation

The opposite can also happen. A sentence may technically start in segment 3, but some part of the model believes it should actually be in segment 4. So then it will oddly stretch out the sentence's timestamps. For example, suddenly the sentence "Thank you" will be 12 seconds long, if the word timestamps are to be believed. And most of that sentence will actually now have an overlap with the next segment.

I've written a lot of code that tries to detect this and then massage things into place.

But it gets more interesting.

2. Sometimes the first word of a sentence will be >>. This is very useful, as it indicates that one of the models believes this is where a new speaker has started speaking. Perhaps it's supposed to be a "fast forward" indicator, to signify that time has been sped up, I'm not sure yet. But I'm very glad it's there, as it's a solid hint that the sentence should be moved to the next segment. Perhaps Whisper Timestamped is also capable of some form of basic segmentation?

I've only seen this >> word appear in longer transcriptions. Normally when using voice chat I pass shorter audio segments to the Whisper worker, and these don't show up. I wish they always did, it's so useful, and easy to filter out.

3. There seems to be a bug or small oversight where the segmentation model can only diarize 2 speakers. Technically it should be capable of separating out three speakers. In practice it will label the first speaker it discovers as 'speaker 2', and speaker 2 as 'Speaker 3'. It skips speaker 1 for some reason. This is too bad, as it would be great if the model could do three people.

4. Sometimes Whisper conks out and doesn't transcribe a sentence. For example, at some point the first long sentence that Biden speaks after the moderator gives him the word is just.. missing. The words aren't in the chunks output at all. Probably just a fluke, but I'll keep my eyes open for it.

paschaldev commented 4 weeks ago

I experienced this issue as well

flatsiedatsie commented 1 week ago

There seems to be a bug or small oversight where the segmentation model can only diarize 2 speakers. Technically it should be capable of separating out three speakers

I dove into the Transformers.js code today, and replicated a local version of post_process_speaker_diarization with lots of commenting so I could see what was going on.

The model theoretically returns which one is the highest value among 7 possible outputs.

0 (silence or non-speech sound, such as laughter) 1 (speaker 1 solo) 2 (speaker 2 solo) 3 (speaker 3 solo) 4 (mutiple speakers: 1 + 2) 5 (multiple speakers: 1 + 3) 6 (multiple speakers: 2 + 3)

I don't think there is an issue in the code. For reasons beyond my comprehension the model never returns a high value for segment ID 1.

This code for the Python version seems to imply there may be a setting for detecting three speakers?

flatsiedatsie commented 6 days ago

I've attempted to create a hacky workaround: whenever a segment it longer than 5 seconds I run the segmentation model on that segment, and so on recursively.

xenova / transformers.js