winddori2002 / DEX-TTS

DEX-TTS: Diffusion-based EXpressive TTS with Style Modeling on Time Variability
MIT License
94 stars 7 forks source link

Probably the inconsistency between paper and the code? #7

Closed yusuke-ai closed 3 weeks ago

yusuke-ai commented 3 weeks ago

Hi,

Thank you for the awesome work!

I'm reading your paper and the code. And maybe it has some inconsistency? The paper says

T-V encoder contains a few residual convolution blocks, but we employ Layer Normalization (LN) instead of IN to preserve temporal relationships in each instance

but the code below doesn't contain such code. https://github.com/winddori2002/DEX-TTS/blob/main/DEX-TTS/model/ref_encoder.py#L131

Should I add layer normalization to the code or is it ok to leave it without LN?

Thank you!

winddori2002 commented 3 weeks ago

Hi, thanks for your interest.

For TV encoder, replacing BN in the TVEncoderBlock with LN worked better. You can check the TVEncoderBlock and BasicConv class.

yusuke-ai commented 3 weeks ago

Thank you for the reply! OK. I will check.