yuhogun0908 / MISOnet

Unofficial Multi-microphone complex spectral mapping for utterance-wise and continuous speech separation(MISO-BF-MISO)
MIT License
51 stars 9 forks source link

Evaluation #6

Closed ermu-tech closed 2 years ago

ermu-tech commented 2 years ago

Hi! Have you tried to evaluate the network with some evaluations like SI-SIR , PESQ , eSTOI , WER , etc? I use my dataset to train the model, and evaluate it with SI-SNR, and the result varied widely on 2mic and 3mic. And during the training, the loss is very large, and vary from a few thousand to about sixty or seventy. I wonder if you have any of the above similar situation? BTW, Would you like to share your email? Or you can contact me through my email 1243330273@qq.com. Looking forward to your reply! Thank you so much!

yuhogun0908 commented 2 years ago

Hi! Sorry for contacting you late. I did not evaluate SI-SIR, PESQ, eSTOI... I have used only 6mic sms_wsj dataset to train the model, and evalutate it with WER(kaldi). The rusult is almost same about the paper. My email is yuhogun10@gmail.com You contact me by email at any time!

Have a nice day~

ermu-tech commented 2 years ago

Thank you for your reply! Could you please explain briefly how to evaluate WER with kaldi? Or do you have any tutorial to share? (Cause I did not find the appropriate tutorial on the Internet. I used to install kaldi but hardly ever use it. So I need your help!

yuhogun0908 commented 2 years ago

Hi! I checked my email, but I have a busy schedule, so I'm replying now. I will inform you how to evaluate WER with kaldi soon.

ermu-tech commented 2 years ago

Thanks! Looking forward to your reply!

rolandhartanto21 commented 1 year ago

Hi, I tried to evaluate the MISO1 model using ASR provided by SMS-WSJ. However, the WER differs significantly. The WER for MISO1 6 mics reported in the original paper is 13.92%, and the one I got is 20.22%. I found six channels in the output audio from tester.py because of the microphone shift. Which one do you use for the evaluation?

I also found that the best model for MISO1 is the one from the epoch 50ish of 100 epochs. Do you have the same behavior in this case?

I'm looking forward to your reply. Thank you for your time.

yuhogun0908 commented 1 year ago

Hi. Sorry for the late reply

You have experience using the first channel output for testing. Although WER did not exactly match the paper, I remember that about 16% of WER came out. Also, I remember that the best WER is from the almost 80epoch.

rolandhartanto21 commented 1 year ago

Thank you for the confirmation. I have another question. In config/NN_BSS.yml at the SMS_WSJ section, there is "speech_source_scaled". Did you perform scaling to the speech source audio for training? If yes, may I know how you did the scaling? Thank you very much.