xinjli / allosaurus

Allosaurus is a pretrained universal phone recognizer for more than 2000 languages
GNU General Public License v3.0
550 stars 86 forks source link

Support Speaker Diarization #26

Closed steveway closed 3 years ago

steveway commented 3 years ago

Hello, As you can see here I've started integrating this project into Papagayo-NG: https://github.com/morevnaproject-org/papagayo-ng/issues/49 The first results from my tests seem to be very promising. Especially the new timestamp feature is helping a lot with that.

Is it possible to add some speaker separation to this? Papagayo-NG itself allows several speakers for one audio file. If we could recognize which parts are spoken by a separate speaker then that would make this a really nice solution for even more animators. I've taken a look at the topic, and it seems to be quite complex. If this could be integrated to Allosaurus then that would be awesome of course. If not there would be ways to get this into Papagayo-NG, we could do a separate pass over the audio. I've taken a look and pyAudioAnalysis seems to already do that. But that would be a big dependency addition.

xinjli commented 3 years ago

Hi, thanks for your suggestion!

Unfortunately, speaker diarization is a much different task from the current recognition task. There is no plan for us to add diarization model.

However, there are a couple of repo doing the diarization task, you can have a look at them here https://github.com/topics/speaker-diarization

I personally used the following one before, it performs well but requires some additional efforts to make it work. https://github.com/google/uis-rnn

steveway commented 3 years ago

I see, that makes sense, thank you. I'll experiment with integrating other tools like that then. I guess this Issue can then be closed.