wenet-e2e / wespeaker

Research and Production Oriented Speaker Verification, Recognition and Diarization Toolkit
Apache License 2.0
598 stars 104 forks source link

Does speaker-diar have vad system dependency? #325

Closed NathanJHLee closed 1 month ago

NathanJHLee commented 1 month ago

Hi,wespeaker team. My name is Nathan. first of all, thank you for your great work. These day I am using wespeaker to implement my own diar-system.

but i faced some error that I don't understand.

First, I trained model 'ECAPA_TDNN-ASTP-emb512-fbank80-num_frms200-aug0.6-spTrue-saFalse-ArcMargin-SGD-epoch150' and used other Language. I checked this model is work properly.

Second, I moved to 'examples/voxconverse/v2' and follow stage 3(vad)-6(Applying spectral clustering).

I would like use my own VAD system. So I used mine and replace original data to new 'data/test/system_sad' as below. silero-vad-3.1) 2spk-00000834-00004734 2spk 0.834 4.734 2spk-00005088-00011358 2spk 5.088 11.358 2spk-00012192-00017502 2spk 12.192 17.502 2spk-00018240-00020382 2spk 18.240 20.382 2spk-00020736-00022302 2spk 20.736 22.302 2spk-00023040-00024798 2spk 23.040 24.798

mine) 2spk-00000940-00004780 2spk 0.940 4.780 2spk-00005080-00011480 2spk 5.080 11.480 2spk-00012160-00017580 2spk 12.160 17.580 2spk-00018220-00020480 2spk 18.220 20.480 2spk-00020740-00022320 2spk 20.740 22.320 2spk-00023040-00024940 2spk 23.040 24.940

So, when I used my own vad time, system shows error. Only chaged thing is 'data/test/system_sad' and follow stage 4-6 bash ./run_test.sh --stage 4 --stop_stage 6 But system shows quiet different result

test_system_sad_labels case1)using silero 2spk-00000834-00004734-00000000-00000150 1 2spk-00000834-00004734-00000075-00000225 1 2spk-00000834-00004734-00000150-00000300 1 2spk-00000834-00004734-00000225-00000375 1 2spk-00000834-00004734-00000300-00000390 1 2spk-00005088-00011358-00000000-00000150 1 2spk-00005088-00011358-00000075-00000225 1 2spk-00005088-00011358-00000150-00000300 1 2spk-00005088-00011358-00000225-00000375 1 2spk-00005088-00011358-00000300-00000450 1 2spk-00005088-00011358-00000375-00000525 1 2spk-00005088-00011358-00000450-00000600 1 2spk-00005088-00011358-00000525-00000627 1 2spk-00012192-00017502-00000000-00000150 1 2spk-00012192-00017502-00000075-00000225 1 2spk-00012192-00017502-00000150-00000300 1 2spk-00012192-00017502-00000225-00000375 1 2spk-00012192-00017502-00000300-00000450 1 2spk-00012192-00017502-00000375-00000525 1 2spk-00012192-00017502-00000450-00000531 1 2spk-00018240-00020382-00000000-00000150 0 2spk-00018240-00020382-00000075-00000214 0 2spk-00020736-00022302-00000000-00000150 0 2spk-00020736-00022302-00000075-00000156 0 2spk-00023040-00024798-00000000-00000150 0 2spk-00023040-00024798-00000075-00000175 0 case2)using mine 2spk-00000940-00004780-00000000-00000150 1 2spk-00000940-00004780-00000075-00000225 1 2spk-00000940-00004780-00000150-00000300 2 2spk-00000940-00004780-00000225-00000375 2 2spk-00000940-00004780-00000300-00000384 2 2spk-00005080-00011480-00000000-00000150 1 2spk-00005080-00011480-00000075-00000225 1 2spk-00005080-00011480-00000150-00000300 1 2spk-00005080-00011480-00000225-00000375 2 2spk-00005080-00011480-00000300-00000450 1 2spk-00005080-00011480-00000375-00000525 1 2spk-00005080-00011480-00000450-00000600 1 2spk-00005080-00011480-00000525-00000640 2 2spk-00012160-00017580-00000000-00000150 1 2spk-00012160-00017580-00000075-00000225 1 2spk-00012160-00017580-00000150-00000300 2 2spk-00012160-00017580-00000225-00000375 2 2spk-00012160-00017580-00000300-00000450 1 2spk-00012160-00017580-00000375-00000525 2 2spk-00012160-00017580-00000450-00000542 2 2spk-00018220-00020480-00000000-00000150 0 2spk-00018220-00020480-00000075-00000225 0 2spk-00018220-00020480-00000150-00000226 0 2spk-00020740-00022320-00000000-00000150 0 2spk-00020740-00022320-00000075-00000158 0 2spk-00023040-00024940-00000000-00000150 0 2spk-00023040-00024940-00000075-00000190 0

Result using silero is right answer. But mine shows bad result. Actually, I implemented vad,embedding,spectral_cluser in c++ env. It also shows error too. Does your system have any dependencies? please give me a advise.

Thank you.

JiJiJiang commented 1 month ago

Thank you for your question.

  1. Re-train your own ECAPA-TDNN model is fine and could perform better in the language of your data if you have enough training data in your domain.
  2. Use your own VAD system is also fine. (Your VAD works well as the results shows. Better VAD contributes to better diar results)

Maybe you can try:

  1. Print np.diff(eig_values[:max_num_spks + 1]) in this https://github.com/wenet-e2e/wespeaker/blob/788e3eb71292af87d4a1708e8812387a82415221/wespeaker/diar/spectral_clusterer.py#L61 The max index decides the number of spks. In your case, I guess the second value and the third value should be very close. And unfortunately, the 3rd value is slightly bigger using your own VAD, while the 2nd one is bigger using silero-VAD.
  2. Try to tune the n parameter in https://github.com/wenet-e2e/wespeaker/blob/788e3eb71292af87d4a1708e8812387a82415221/wespeaker/diar/spectral_clusterer.py#L42 or https://github.com/wenet-e2e/wespeaker/blob/788e3eb71292af87d4a1708e8812387a82415221/wespeaker/diar/spectral_clusterer.py#L44 Use a smaller n could probably get less spk number.
NathanJHLee commented 1 month ago

Thank you for your response. I still don't understand enven though vad results are almost same, spectral clustering behaves differently. I followd your instruction 'n=n-1 or n=n-2' Both of those worsk. But It works only a test wav when i use my own VAD. I wondering parameter 'n' is generated based on number of subsegments. How to control of 'n'?

JiJiJiang commented 1 month ago

n = int((1.0 - p) * m), m is the number of subsegments. If m is less than 1000, we keep n=m-10; else n = m*(1-p). p is usually tuned in a dev set. From our experience, p in [0.01, 0.05] could be fine. Some papers have discussed about the p value. link

NathanJHLee commented 1 month ago

All of my test set are less than 1000 subsegments, so only takes 'n = max(m - 10, 2) ' But mainly difference on vad is unit about second. I realized that my implemented vad system outputs in units of 10ms, while your outputs in units of 1ms. Could you please check your system working well by using units of 10ms?

JiJiJiang commented 1 month ago

The uint of vad (1ms or 10ms) could not have such a big difference.

NathanJHLee commented 1 month ago

Yeap that's true.... So i tried to get results using your pretrained Model. After testing, I realized my model trained by myself has little problem even though EER result is not bad. I will figure it out what's matter. Thank you so much to help me.