Question about loss=nan.

yangdaowu commented 3 years ago

Hello

I want to know when the run reaches 115 epochs with loss=nan, I checked the checkpoints and the last saved ckpt is at the 79th epoch

Using the example you gave python main.py --problem_name conditioned_separation --mode train --run_id lasaft_net --musdb_root etc/musdb18_dev_wav --gpus 1 --precision 16 --batch_size 6 --num_workers 0 --pin_memory True --save_top_k 3 --save_weights_only True --patience 10 --lr 0.001 --model CUNET_TFC_GPoCM_LaSAFT

I await feedback, thank you very much:

ws-choi commented 3 years ago

Hi @yangdaowu, Was the dataset you used Musdb sample dataset? Did you download it from this guide?

yangdaowu commented 3 years ago

Yes, I am using the dataset MUSDB18

ws-choi commented 3 years ago

I mean, was it the sample dataset of MUSDB18 or the full dataset of it?

yangdaowu commented 3 years ago

Is its complete data set

yangdaowu commented 3 years ago

This happens when I keep dragging the command line But it's stuck here

ws-choi commented 3 years ago

I suspected overfitting due to the small size of the sample training dataset, but it was the full dataset, I understand now.

I also encountered NaN sometimes when training LaSAFT-based model, especially when training larger models. It has quite sensitive hyperparameters to tune.

I checked the below script works quite well, but it is for a larger model (Table2 in our paper, n_fft=4096 version, need 4 gpus).

python main.py --problem_name conditioned_separation --mode train --musdb_root ../repos/musdb18_wav --n_blocks 9 --num_tdfs 6 --n_fft 4096 --hop_length 1024 --precision 16 --embedding_dim 64 --pin_memory True --save_top_k 3 --patience 10 --deterministic --model lasaft_net --gpus 4 --distributed_backend ddp --sync_batchnorm True --run_id lasaft_2021_als --batch_size 4 --seed 2021 --log wandb --lr 0.0001 --auto_lr_schedule True

I share what I've tried to avoid NaN issue.

[ ] try larger batch size: try --batch_size 8 instead of 6
[ ] try smaller learning rate: If it is not possible to enlarge batch size, then try --lr 0.0005
[ ] rmsprop: --optimizer rmsprop was relatively robust for training LaSAFT (default: --optimizer adam)
[ ] deterministic: although --deterministic Ture slows down training, but it helps NaN dubugging.

added

auto learning rate scheduling option also might be helpful, but it could slow training down, or even lead a saddle point problem.

[ ] --auto_lr_schedule True

I have changed the factor of scheduler to be 0.5 (before, 0.1, I found it was too severe). Please pull the master branch.

https://github.com/ws-choi/Conditioned-Source-Separation-LaSAFT/blob/88a522604886434ca808ad80d883fd6f00d84572/lasaft/source_separation/conditioned/separation_framework.py#L56

yangdaowu commented 3 years ago

Thank you. I will reset it after this training

ws-choi / Conditioned-Source-Separation-LaSAFT

Question about loss=nan. #12

added