yl4579 / StarGANv2-VC

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
MIT License
479 stars 107 forks source link

What is the Overall Compute Footprint of the Pipeline? #49

Closed snakers4 closed 2 years ago

snakers4 commented 2 years ago

Hi @yl4579!

Many thanks for your repo and the paper.

In the paper I have found the following statements:

We train our model for 150 epochs, with a batch size of 10 two- second long audio segments.

Using Parallel WaveGAN vocoder, our model can convert an audio clip hundreds of times faster than real time on Tesla P100, which makes it suitable for real- time voice conversion applications.

I understand that GAN training is tricky and that mileage may vary depending on the machine used, but in our practice we mostly measure compute required in GPU-days based on some popular GPUs, namely 1080 Ti / 3090 / V100 / A100 etc.

This of course very crude and assumes that code is "perfect" and GPU is not "waiting" and that there are no other delays (i.e. validation, start-up time, etc), that CPU resources are adequate and that there are no IO blocks.

But I have not found anywhere in the paper and the issues, what was the approximate compute footprint of this pipeline for:

As for STT I know from practice how long they train. Thank you in advance!

yl4579 commented 2 years ago

Sorry for the late reply. I was pretty busy at the end of my semester. I will make the F0 training code available by the end of this month. As for StarGANv2-VC training, it took approximately 2 days to train on a Tesla P100 for 150 epochs with a batch size of 5. You may increase the batch size to make the training faster.

snakers4 commented 2 years ago

Many thanks for the info!