Closed yangdaowu closed 3 years ago
Hi @yangdaowu, Was the dataset you used Musdb sample dataset? Did you download it from this guide?
Yes, I am using the dataset MUSDB18
I mean, was it the sample dataset of MUSDB18 or the full dataset of it?
Is its complete data set
This happens when I keep dragging the command line But it's stuck here
I suspected overfitting due to the small size of the sample training dataset, but it was the full dataset, I understand now.
I also encountered NaN sometimes when training LaSAFT-based model, especially when training larger models. It has quite sensitive hyperparameters to tune.
I checked the below script works quite well, but it is for a larger model (Table2 in our paper, n_fft=4096 version, need 4 gpus).
python main.py --problem_name conditioned_separation --mode train --musdb_root ../repos/musdb18_wav --n_blocks 9 --num_tdfs 6 --n_fft 4096 --hop_length 1024 --precision 16 --embedding_dim 64 --pin_memory True --save_top_k 3 --patience 10 --deterministic --model lasaft_net --gpus 4 --distributed_backend ddp --sync_batchnorm True --run_id lasaft_2021_als --batch_size 4 --seed 2021 --log wandb --lr 0.0001 --auto_lr_schedule True
I share what I've tried to avoid NaN issue.
--batch_size 8
instead of 6--lr 0.0005
--optimizer rmsprop
was relatively robust for training LaSAFT (default: --optimizer adam
)--deterministic Ture
slows down training, but it helps NaN dubugging.auto learning rate scheduling option also might be helpful, but it could slow training down, or even lead a saddle point problem.
--auto_lr_schedule True
I have changed the factor of scheduler to be
0.5
(before, 0.1, I found it was too severe). Please pull the master branch.
Thank you. I will reset it after this training
Hello
I want to know when the run reaches 115 epochs with loss=nan, I checked the checkpoints and the last saved ckpt is at the 79th epoch
Using the example you gave python main.py --problem_name conditioned_separation --mode train --run_id lasaft_net --musdb_root etc/musdb18_dev_wav --gpus 1 --precision 16 --batch_size 6 --num_workers 0 --pin_memory True --save_top_k 3 --save_weights_only True --patience 10 --lr 0.001 --model CUNET_TFC_GPoCM_LaSAFT
I await feedback, thank you very much: