Absolute speaker diarization?

flatsiedatsie commented 2 months ago

Question

I've just managed to integrate the new speaker diarization feature into my project. Very cool stuff. My goal is to let people record meetings, summarize them, and then also list per-speaker tasks. This seems to be a popular feature.

One thing I'm running into is that I don't feed Whisper a single long audio file. Instead I use VAD to feed it small chunks of live audio whenever someone speaks.

However, as far as I can tell the speaker diarization only works "relatively", detecting speakers within a single audio file.

Is there a way to let it detect and 'sort' the correct speaker over multiple audio files? Perhaps it could remember the 'audio fingerprints' of the speakers somehow?

record_meeting

flatsiedatsie commented 2 months ago

Going through the sourcecode a bit more I found an existing ability for speaker verification.

My plan is to:

Try and first use diarization as normal
Using the segment info, somehow reconstruct a good chunk of consecutive audio frames for each identified speaker.
Use the verification ability to get a 512-length vector 'fingerprint' for each speaker. I found example code here.
Keep a list of these fingerprints, and find a way to compare new and old fingerprints, and get likelyhood score that it's the same speaker, or a brand new one.

Going though the code I also noticed that only up to three speakers can be separated with diarization. But with these short snippets of audio and the verification mechanism, that would no longer be a limitation. Win-Win!

flatsiedatsie commented 2 months ago

I've got it somewhat working!

I'm testing it now.

One odd thing that just happened is that all the chunks of a recording (me doing a chipmunk voice to pretend to be a second person) got the same timestamp (5.2). Screenshot:

MatteoFasulo commented 2 months ago

Question

I've just managed to integrate the new speaker diarization feature into my project. Very cool stuff. My goal is to let people record meetings, summarize them, and then also list per-speaker tasks. This seems to be a popular feature.

One thing I'm running into is that I don't feed Whisper a single long audio file. Instead I use VAD to feed it small chunks of live audio whenever someone speaks.

However, as far as I can tell the speaker diarization only works "relatively", detecting speakers within a single audio file.

Is there a way to let it detect and 'sort' the correct speaker over multiple audio files? Perhaps it could remember the 'audio fingerprints' of the speakers somehow?

This is very cool! Could be good also for extracting clips from podcasts or YouTube videos with many speakers 👍🏼

flatsiedatsie commented 2 months ago

It turned out even cooler than that :-)

I ask a speaker with a new fingerprint to first say "I consent to recording my voice". Only once they've said that will their contribution show up. Otherwise it just says "Redacted - no consent".

I also made it so that you can say "My name is X", and from then on it will preface your contribution with your name instead of "Speaker0", etc.

eschmidbauer commented 2 months ago

could you share you implemented VAD?

flatsiedatsie commented 2 months ago

@eschmidbauer Have a look here for an easy to use one: https://github.com/ricky0123/vad

flatsiedatsie commented 2 months ago

For people finding this thread: you may also want to look at this recently added wespeaker-voxceleb-resnet34model that is also designed to create audio fingerprints for voices. It says it only supports English and Chinese. I haven't tried it yet, but I'm curious to see how it would compare to wavlm-base-plus-sv, since wavlm-base-plus-sv isn't great. But I might just be using it wrong (feeding it too much or too little data, etc).

Here's my final pipeline:

class PipelineSingleton {
    static asr_model_id = 'onnx-community/whisper-base_timestamped';
    static instance = null;
    static asr_instance = null;

    static segmentation_model_id = 'onnx-community/pyannote-segmentation-3.0';
    static segmentation_instance = null;
    static segmentation_processor = null;

    static verification_model_id = 'Xenova/wavlm-base-plus-sv';
    static verification_instance = null;
    static verification_processor = null;

    static async getInstance(progress_callback = null,model_name='onnx-community/whisper-base_timestamped',preferences={}) {
        console.log("Whisper_worker: Pipeline: getInstance:  model_name, preferences: ", model_name, preferences);
        this.asr_model_id = model_name;

        PER_DEVICE_CONFIG[self.device] = {...PER_DEVICE_CONFIG[self.device],preferences}

        this.asr_instance ??= pipeline('automatic-speech-recognition', this.asr_model_id, {
            ...PER_DEVICE_CONFIG[self.device],
            progress_callback,
        });

        this.segmentation_processor ??= AutoProcessor.from_pretrained(this.segmentation_model_id, {
            ...preferences,
            progress_callback,
        });
        this.segmentation_instance ??= AutoModelForAudioFrameClassification.from_pretrained(this.segmentation_model_id, {
            // NOTE: WebGPU is not currently supported for this model
            // See https://github.com/microsoft/onnxruntime/issues/21386
            device: 'wasm',
            dtype: 'fp32',
            ...preferences,
            progress_callback,
        });

        this.verification_processor ??= AutoProcessor.from_pretrained(this.verification_model_id, {
            device: 'wasm',
            dtype: 'fp32',
            ...preferences,
            progress_callback,
        });

        this.verification_instance ??= AutoModel.from_pretrained(this.verification_model_id, {
            device: 'wasm',
            dtype: 'fp32',
            ...preferences,
            progress_callback,
        });

        return Promise.all([this.asr_instance, this.segmentation_processor, this.segmentation_instance, this.verification_processor, this.verification_instance]);
    }
}

Be advised that you'll also need to write a lot of code to 'clean up' the output from the various models.

whisper-base_timestamped seems to really need the start of the audio to be a bit of silence. If it doesn't get that it will give strange timestamps that are outside of the duration of the audio. So it's easy enough to recognize.
Pyannote is very good at recognizing different speakers. Maybe because of the FP32?
pyannote-segmentation-3.0 will happily return 9 segments for audio of two speakers. A lot of these segments will be for silence in between sentences. In my case I added those segments to the next 'real' speaker's segment.
pyannote will output various speaker ID's. Zero means: no speaker detected. 1,2 or 3 refers to an individual speaker. Higher numbers mean two speakers were talking at the same time.
In order to match spoken sentences to segments I recommend this route: create the joined, simplified list of segments first. Then use the words array to reconstruct individual sentences and their timestamps. Use the timestamp of the second word of the sentence to match it to a segment.
wavlm-base-plus-sv is.. unpredictable. It does well at figuring out a female speaker is a different one from a male speaker. But it could not really tell Joe Biden and Donald Trump apart. Again, I may just be using it poorly. One trick to improve performance could be to record multiple fingerprints per speaker. That might make the comparison better, I don't know. At the moment I feed it at least 4000 frames of audio, up to 32000 (2 seconds) if available.

Here's some example code for how I 'cleaned up' the segments:

    let last_speaker_id = null;
    let joined_segment_end = null;

    for(let s = 0; s < segments.length; s++){
        segments[s]['original_id'] = segments[s].id;

        if(segments[s].id > 0 && segments[s].id < 4){
            last_speaker_id = segments[s].id;
        }

        // Sometimes there are weird, very short segments at the beginning, less than a tenth of a second long. This causes them to be pruned later
        if(s < 3 && segments[s].start < (s * 0.1)){
            segments[s].start = 0;
        }
        joined_segment_end = segments[s].end;
    }

    for(let s = segments.length - 1; s >= 0; --s){
        //console.log("segment: ", s);
        if(typeof segments[s] == 'undefined'){
            console.error("segment no longer existed at position: ", s);
            continue
        }

        if(typeof segments[s].id == 'number' && typeof segments[s].confidence == 'number' && typeof segments[s].start == 'number' && typeof segments[s].end == 'number'){

            // only speaker ID's of 1, 2 or 3 refer to individual speakers. zero means no speaker (silence), and 4 and above is for combinations of speakers (speaking at the same time)
            // TODO: this just steamrolls over mixed speakers, assigning the ID of the speaker that ends up speaking on it's own afterwards.
            if( (segments[s].id == 0 || segments[s].id >= 4) && last_speaker_id != null){

                //console.log("changing a segment's ID.  old -> new, and duration: ", segments[s].id,last_speaker_id,segments[s].end - segments[s].start);
                segments[s].id = last_speaker_id;
            }

            if(segments[s].id > 0 && segments[s].id < 4){
                //console.log("segment has good single speaker ID: ", segments[s].id);

                if(segments[s].id != last_speaker_id){
                    //console.log("switching to another speaker");
                    last_speaker_id = segments[s].id;
                    joined_segment_end = segments[s].end;
                }
                else{
                    //console.log("still the same speaker speaking");
                    if(joined_segment_end != null){
                        if(joined_segment_end != segments[s].end){
                            segments[s].end = joined_segment_end;

                            if(typeof segments[s + 1] != 'undefined' && segments[s+1].id == segments[s].id && reached_zero == false){
                                //console.log("removing older segment with the same ID as this one");
                                segments.splice(s + 1, 1);
                            }
                        }

                    }

                }

            }
            // TODO: Could distinguish between silence and mixed speakers here
            else{
                console.error("segment has bad speaker ID: ", segments[s]);
            }

            // Remove very short segments from the beginning
            if(segments[s].start == 0 && reached_zero == true){
                segments.splice(s, 1);
            }
            else if(segments[s].start == 0 && reached_zero == false){
                reached_zero = true;
            }

            //console.log("reached_zero: ", reached_zero);
            //console.log("joined_segment_end: ", joined_segment_end);

        }
        else{
            console.error("segment was missing basic attributes: ", segments[s]);
        }

    }

I keep a dictionary of voice fingerprints related to speaker ID's.

for(let f = 0; f < self.fingerprints.length; f++){

        if(fingerprints_to_skip.indexOf(f) != -1){
            // skip, already used
        }
        else{
            if(typeof self.fingerprints[f].embedding != 'undefined'){
                //console.log("verify segment: comparing ", f, self.fingerprints[f].embedding, logit_embedding);
                try{
                    const similarity = cosinesim(self.fingerprints[f].embedding,logit_embedding);
                    console.log("verify segment: SIMILARITY: ", f, similarity);
                    fingerprint_matches.push(similarity);
                    if(similarity > highest_match){
                        highest_match = similarity;
                        if(similarity > 0.94){
                            found_id = f;
                        }

                    }
                }
                catch(err){
                    console.error("verify segment: error doing similarity check: ", err);
                }
            }
        }

}

And here's the cosine simularity function I use to compare the voice fingerprints:

function cosinesim(A,B){
    var dotproduct=0;
    var mA=0;
    var mB=0;
    for(let i = 0; i < A.length; i++){
        dotproduct += (A[i] * B[i]);
        mA += (A[i]*A[i]);
        mB += (B[i]*B[i]);
    }
    mA = Math.sqrt(mA);
    mB = Math.sqrt(mB);
    var similarity = (dotproduct)/((mA)*(mB))
    return similarity;
}

Since it's somewhat working now I'll close this issue.

xenova / transformers.js

Absolute speaker diarization? #873

Question

Question