yl4579 / StarGANv2-VC

StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion
MIT License
466 stars 110 forks source link

Regarding Non-speech Vocal data in a dataset #92

Closed SoshyHayami closed 8 months ago

SoshyHayami commented 8 months ago

I was wondering if I could include laughing, sobbing and crying sounds of each person in the dataset, is it possible to clone these as well? Since I assume there's no phonemes in these sounds, I'm worried about it affecting the overall quality.

If it's possible, How much do you think would be fair to Include?


And sorry, while I'm here let me ask another question I've had; Should the training samples be of the same length? (which seem to be 5 seconds long.) what happens if I have samples with varying lengths?

yl4579 commented 8 months ago

The training sample can be any length because we only use a clip anyway, and making 5 seconds of audio is only for convenience. It also works for laughing, breathing and crying, although the pretrained ASR model needs to be trained on datasets with these things too.