winddori2002 / DEX-TTS

DEX-TTS: Diffusion-based EXpressive TTS with Style Modeling on Time Variability
MIT License
87 stars 6 forks source link

Probably the inconsistency between paper and the code? #7

Open yusuke-ai opened 1 day ago

yusuke-ai commented 1 day ago

Hi,

Thank you for the awesome work!

I'm reading your paper and the code. And maybe it has some inconsistency? The paper says

T-V encoder contains a few residual convolution blocks, but we employ Layer Normalization (LN) instead of IN to preserve temporal relationships in each instance

but the code below doesn't contain such code. https://github.com/winddori2002/DEX-TTS/blob/main/DEX-TTS/model/ref_encoder.py#L131

Should I add layer normalization to the code or is it ok to leave it without LN?

Thank you!

winddori2002 commented 12 hours ago

Hi, thanks for your interest.

For TV encoder, replacing BN in the TVEncoderBlock with LN worked better. You can check the TVEncoderBlock and BasicConv class.

yusuke-ai commented 10 hours ago

Thank you for the reply! OK. I will check.