wenet-e2e / wespeaker

Research and Production Oriented Speaker Verification, Recognition and Diarization Toolkit
Apache License 2.0
594 stars 102 forks source link

Kaldi Fbanks #328

Closed Spectra456 closed 1 week ago

Spectra456 commented 2 weeks ago

Hi, thanks for repository! I have 2 questions:

  1. Is there any specific reason why kaldi-fbanks was used to train/inference model? Did you tried other realizations, like from the torchaudio itself MelSpectrogram?
  2. I understand that kaldifeat realization is the fastest open-source realization, but anyway it's slow down whole pipeline, even it's work on GPU, because of all this memory transactions between 2 models in Triton Server. I think fastest way to do it is to write a realization of fbanks on pure pytorch, convert it to ONNX and at the moment when we export speaker verification model to onnx, combine fbank onnx model and speaker verification onnx model. It would prevent all unnecessary memory movements in Triton Server.
JiJiJiang commented 2 weeks ago
  1. As you say, kaldifeat C++ realization is open-source and easy to use for deployment, which is much more convenient than MelSpectrogram in torchaudio.
  2. Yes, since torchaudio.compliance.kaldi.fbank cannot directly be part of the model, I agree with you that write a realization of fbanks on pure pytorch may be the best way to fix this problem. And then only one onnx model is needed to be exported. However, due to the history reason, we need to do a lot of experiments. If you are interested in this, you are very welcomed to contribute. But first of all, the performance should totally match torchaudio.compliance.kaldi.fbank