Closed moaazassali closed 1 year ago
Hi,
It works well as long as the audio file contains a single speaker's speech. In addition, I used random frame sampling (you can see the code of the dataset class), which samples frames about 1 second. But I think it is better to increase the batch size for optimization.
Further, I think the data is too small to train from scratch and to get high performance. Probably better to adopt fine-tuning with a lower learning rate.
Thanks
Hello,
I tried fine-tuning with a 9:32 clip of Trump (from here https://www.youtube.com/watch?v=6a1Mdq8-_wo), but I am getting bad results. I split the audio into 6:40 training and 2:52 validation. Here are the steps I took for preprocessing:
ExtractMelstats()
function by just removing the loop and working on one wav_name instead. ProcessingTrainData()
to set the speaker variable to be the same as wav_name. With those edits, I ran my own preprocess_custom.py code as shown below. I didn't call GetSpeakerInfo()
or SplitDataset()
since I just had two files and hard-coded the relevant details. I also did not call the GetMetaResults()
function because it deals with text and it didn't look like the training code used it.
Overall, I think the code works as expected with the two wav files for training purposes.
def main(cfg):
seed_init()
MakeDir(cfg.output_path)
trump_train_wav = './base_data/trump/wav/trump_train.wav'
trump_valid_wav = './base_data/trump/wav/trump_valid.wav'
wn2info = {}
print('---Feature extraction---')
train_result = ProcessingTrainData(trump_train_wav, cfg)
valid_result = ProcessingTrainData(trump_valid_wav, cfg)
wav_name, mel, lf0, mel_len = train_result
wn2info[wav_name] = [mel, lf0, mel_len, "trump_train"]
mean, std = ExtractMelstats(wn2info, wav_name, cfg) # only use train wav for normalizing stats
print('---Write Train Features---')
train_results = SaveFeatures(wav_name, wn2info[wav_name], 'train', cfg)
wav_name, mel, lf0, mel_len = valid_result
wn2info[wav_name] = [mel, lf0, mel_len, "trump_valid"]
print('---Write Valid Features---')
valid_results = SaveFeatures(wav_name, wn2info[wav_name], 'valid', cfg)
print('---Write Infos---')
Write_json([train_results], f'{cfg.output_path}/train.json')
Write_json([valid_results], f'{cfg.output_path}/valid.json')
print('---Done---')`
With the preprocessing done, I then started the training and the only change I made was commenting out the Tester() part in train.py and main.py, which uses eval data that I don't have. From what I understand, that has no effect on the training, so commenting it out shouldn't impact the performance. I fine-tuned the model with learning rate of 1e-6 and 1000 epochs with the --resume=True option starting from the model-mel-split.pth provided. The lowest valid loss was at epoch 1 and did not improve after. When I also do the voice conversion with convert.py using the the latest fine-tuned model, the audio is very bad and has a lot of 'static' noise. In fact, the more I increase the epochs the less intelligible it becomes.
I am not sure if I am doing something wrong in the code or perhaps my preprocessing script is missing something. Have you tried fine-tuning the model with another voice (like any public figure with online videos for testing purposes)?
Thanks, and any help would be appreciated!
EDIT: Also the training for this audio clip with lr=1e-6 and 1000 epochs took ~20 mins. Not sure if that is relevant, but I though it was a bit quick?
Hi,
I have not tried fine-tuning for the specific person. But, I think there are some things to try.
You can check the vocoder performance whether the speech can be reconstructed well (without forwarding VC model).
And I checked the video, and I think it's better to split the whole video into several segments (e.g, 10 seconds).
It increases batch size and can be helpful for optimization. And need to adjust lr.
Thanks.
Additionally, it can be the problem of "resume".
What if just load the previous weights and fine-tuning them with a new optimizer?
And if you use pre-trained model it can be better to use the statistics of the VCTK dataset. (mean and std for normalization)
I was wondering if it is possible to just train on one long wav file of 10 mins+ and split it into 3 files with 60%, 20%, and 20% for train, validation, and test set like the paper mentions. Does that work right away or will I have to split the long audio file into separate audio files of single sentences like the VCTK dataset?