Tracking speakers in time

niemiaszek commented 1 year ago

Thanks for your contribution and for recent updates with MultiStageClusterer and streaming_predict method!

I have a question about tracking speakers in time. In my case in online inference I put the most attention to the newest added embedding. However, I want these newest predictions to match the ones that were made in past (I want the samples to hold same speaker ID in time). I've noticed that with this clustering approach IDs can easily change in time, where obviously enforce_ordered_labels method wont help.

Have you tried to introduce some method for tracking the speakers clusters? I think it could be solved like maximum bipartite matching, using clusters with samples indices in time, calculating the clusters similarity from previous step and current step as Jaccard index.

For example, after 5 steps, we would get list like [0, 0, 0, 1, 1]. In next iteration, we get 6th embedding and get swapped ids for clusters like: [1, 1, 1, 0, 0, 1]. This would be especially ruining, if we took only the last result from every prediction.

With proposed method, we could represent reference clusters: {0: {0, 1, 2}, 1: {3, 4}}, and new clusters: {0: {3, 4}, 1: {0, 1, 2, 5}}. In this example enforcing ordered ids would help, but in general it might be not that useful, especially when we make some adjustments to previous clusters (f.e. we would get [0, 0, 1, 2, 2] if sample number 3 was assigned to its own cluster after update).

wq2012 commented 1 year ago

Thanks for following our work, and yeah this is a great question!

Assume the prediction labels at time T is X[:T] and the predictions labels at time T+1 is Y[:T+1], the ideal solution is to use a Hungarian algorithm to find the best matching between X[:T] and Y[:T].

Or alternatively, for simplicity and efficiency (also less dependency in runtime libraries), one can use a greedy algorithm to approximate the best matching.

E.g. for your case, Y[:T]=[1, 1, 1, 0, 0], so 1 is the dominating label, then we look at X[:T]=[0, 0, 0, 1, 1], and see that the 1 is mostly projected to 0. So we map 1 to 0. Then we look at the second dominating label in Y[:T] and do the same until we have all the mappings. Then we apply the mapping to Y, and the mapped new Y' will be much more consistent with X.

I believe we are using the greedy algorithm internally at Google but I don't remember all details (e.g. whether to map from X to Y or to map from Y to X) at hand since it is implemented by someone else.

In this Python library (which is more of a demonstration instead of a reproduction) I haven't implemented it however, but may add it later.

P.S. MultiStageClusterer is proposed to be able to process hours long audios on mobile phones efficiently. This library is already in Python, inefficient, and impossible to port to mobile, anyway...

niemiaszek commented 1 year ago

Thank you for comprehensive answer! I will try both methods!

Would be nice to see speaker tracking as a feature in this library. After my initial experiments it looks like a big part of getting online inference right. Otherwise I found the library quite complete 👍

I understand that this is "just" a Python demo, but it's enough to check how the usage of such clustering with the rest of the diarization pipeline can go.

wq2012 commented 1 year ago

Just added a deflicker option to the MultiStageClusterer class.

wq2012 / SpectralCluster

Tracking speakers in time #48