yl4579 / StyleTTS2

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
MIT License
4.38k stars 340 forks source link

weird pulse at the end of the model #216

Open matmult opened 3 months ago

matmult commented 3 months ago

As titled, there are a lot of comments in the code that says "weird pulse at the end of the model". I would just like to know if this has been fixed.

Ziyueyork commented 1 month ago

I use whisper-large-v3 for adversarial training instead of wav-lm. The pulse disappeared after finetune. Hope it helps.

mocialov commented 1 month ago

@Ziyueyork I have tried train_finetune.py with slm: model: 'openai/whisper-large-v3', but I get Whisper expects the mel input features to be of length 3000, but found 20000. Make sure to pad the input mel features to 3000. Do you have a suggestion as to how to get to the right shape? I use max_len: 100

Ziyueyork commented 1 month ago

@Ziyueyork I have tried train_finetune.py with slm: model: 'openai/whisper-large-v3', but I get Whisper expects the mel input features to be of length 3000, but found 20000. Make sure to pad the input mel features to 3000. Do you have a suggestion as to how to get to the right shape? I use max_len: 100

I guess you pass the wave vector to Whisper model as the original code passes it to WavLM. WavLM model accepts that but Whisper needs log mel spectrogram as the input. So you need to add feature extract code in WavLMLoss class in losses.py. Here is a reference code for feature extraction. https://github.com/huggingface/transformers/blob/v4.40.2/src/transformers/models/whisper/feature_extraction_whisper.py