yistLin / dvector

Speaker embedding (d-vector) trained with GE2E loss
272 stars 46 forks source link

Make preprocessing fully differentiable with torch API #4

Closed HudsonHuang closed 3 years ago

HudsonHuang commented 3 years ago

I appreciate your efforts, nice work. But your audio_toolkit was implement in librosa and numpy, which was not differentiable. It might limited the application. Eg. If I have an TTS model to generated Mel spectrogram, and if your dvector if fully differentiable, we can use this like a discriminator, to force the TTS model output exactly as expected person. From waveform to Melspectrogram, you can make preprocessing fully differentiable with torchaudio, and it seems it can keep consitency with librosa

yistLin commented 3 years ago

Hi, thanks for your suggestion. I'm actually considering ditching librosa for torchaudio especially after I chose to do silence trimming with sox instead of webrtcvad.

Since I'd like to make the preprocessing modules as simple as possible (import less packages as possible), I probably need some time to study the usage of sox effects in the most recent version of torchaudio.

yistLin commented 3 years ago

I've developed completely new preprocessing toolkits which use torchaudio, can be compiled with TorchScript and be used anywhere without any dependencies.