winddori2002 / TriAAN-VC

TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion
MIT License
129 stars 12 forks source link

Can I train on one long audio file (10 mins+)? #4

Closed moaazassali closed 1 year ago

moaazassali commented 1 year ago

I was wondering if it is possible to just train on one long wav file of 10 mins+ and split it into 3 files with 60%, 20%, and 20% for train, validation, and test set like the paper mentions. Does that work right away or will I have to split the long audio file into separate audio files of single sentences like the VCTK dataset?

winddori2002 commented 1 year ago

Hi,

It works well as long as the audio file contains a single speaker's speech. In addition, I used random frame sampling (you can see the code of the dataset class), which samples frames about 1 second. But I think it is better to increase the batch size for optimization.

Further, I think the data is too small to train from scratch and to get high performance. Probably better to adopt fine-tuning with a lower learning rate.

Thanks

moaazassali commented 1 year ago

Hello,

I tried fine-tuning with a 9:32 clip of Trump (from here https://www.youtube.com/watch?v=6a1Mdq8-_wo), but I am getting bad results. I split the audio into 6:40 training and 2:52 validation. Here are the steps I took for preprocessing:

With those edits, I ran my own preprocess_custom.py code as shown below. I didn't call GetSpeakerInfo() or SplitDataset() since I just had two files and hard-coded the relevant details. I also did not call the GetMetaResults() function because it deals with text and it didn't look like the training code used it.

Overall, I think the code works as expected with the two wav files for training purposes.

def main(cfg):
    seed_init()
    MakeDir(cfg.output_path)

    trump_train_wav = './base_data/trump/wav/trump_train.wav'
    trump_valid_wav = './base_data/trump/wav/trump_valid.wav'

    wn2info = {}

    print('---Feature extraction---')
    train_result = ProcessingTrainData(trump_train_wav, cfg)
    valid_result = ProcessingTrainData(trump_valid_wav, cfg)

    wav_name, mel, lf0, mel_len = train_result
    wn2info[wav_name] = [mel, lf0, mel_len, "trump_train"]

    mean, std = ExtractMelstats(wn2info, wav_name, cfg) # only use train wav for normalizing stats

    print('---Write Train Features---')
    train_results = SaveFeatures(wav_name, wn2info[wav_name], 'train', cfg)

    wav_name, mel, lf0, mel_len = valid_result
    wn2info[wav_name] = [mel, lf0, mel_len, "trump_valid"]

    print('---Write Valid Features---')
    valid_results = SaveFeatures(wav_name, wn2info[wav_name], 'valid', cfg)

    print('---Write Infos---')
    Write_json([train_results], f'{cfg.output_path}/train.json')
    Write_json([valid_results], f'{cfg.output_path}/valid.json')

    print('---Done---')`

With the preprocessing done, I then started the training and the only change I made was commenting out the Tester() part in train.py and main.py, which uses eval data that I don't have. From what I understand, that has no effect on the training, so commenting it out shouldn't impact the performance. I fine-tuned the model with learning rate of 1e-6 and 1000 epochs with the --resume=True option starting from the model-mel-split.pth provided. The lowest valid loss was at epoch 1 and did not improve after. When I also do the voice conversion with convert.py using the the latest fine-tuned model, the audio is very bad and has a lot of 'static' noise. In fact, the more I increase the epochs the less intelligible it becomes.

I am not sure if I am doing something wrong in the code or perhaps my preprocessing script is missing something. Have you tried fine-tuning the model with another voice (like any public figure with online videos for testing purposes)?

Thanks, and any help would be appreciated!

EDIT: Also the training for this audio clip with lr=1e-6 and 1000 epochs took ~20 mins. Not sure if that is relevant, but I though it was a bit quick?

winddori2002 commented 1 year ago

Hi,

I have not tried fine-tuning for the specific person. But, I think there are some things to try.

You can check the vocoder performance whether the speech can be reconstructed well (without forwarding VC model).

And I checked the video, and I think it's better to split the whole video into several segments (e.g, 10 seconds).

It increases batch size and can be helpful for optimization. And need to adjust lr.

Thanks.

winddori2002 commented 1 year ago

Additionally, it can be the problem of "resume".

What if just load the previous weights and fine-tuning them with a new optimizer?

And if you use pre-trained model it can be better to use the statistics of the VCTK dataset. (mean and std for normalization)