yl4579 / StyleTTS-ZS

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
92 stars 1 forks source link

Training details #1

Open lumpidu opened 3 days ago

lumpidu commented 3 days ago

Hi, very interesting paper!

Could you in the process of publishing the training scripts also add some intuition about the training procedure and your training metrics for the GPU's/no. of steps/memory requirements, etc. ?

Thanks in advance !

yl4579 commented 3 days ago

Thanks for your interest in this work! I'm very busy right now writing another paper and also preparing for job hunting and graduation, but I have included all the information needed for training in the Model Training section of the paper. I did training using Jupyter Notebook again, so it was pretty messy, but I'll share the code once it's cleaned.

It can take some time to clean the code, especially on the librilight dataset. The big model took a month to train on my lab's GPUs, although some experiments were conducted on H100 during my internship, which made it much faster. If anyone is willing to provide computation resources to debug/clean the code on large-scale models, feel free to email me at yl4579@columbia.edu. Also email me too if you want to help me debug/clean the code.

yl4579 commented 2 days ago

I have gotten many emails in less than a day. Thank you very much! However, I think it is difficult to coordinate the task individually through email, so I have created a discord server for that purpose. Please join the discord server if you are willing to help :)