training command about REDS with train_EDVR_M.yml

xinntao / EDVR

Winning Solution in NTIRE19 Challenges on Video Restoration and Enhancement (CVPR19 Workshops) - Video Restoration with Enhanced Deformable Convolutional Networks. EDVR has been merged into BasicSR and this repo is a mirror of BasicSR.

https://github.com/xinntao/BasicSR

1.48k stars 320 forks source link

training command about REDS with train_EDVR_M.yml #67

Open Marshall-yao opened 5 years ago

Marshall-yao commented 5 years ago

Hi,xintao. Is the training command python -m torch.distributed.launch --nproc_per_node=1 --master_port=4321 train.py -opt options/train/train_EDVR_M.yml --launcher pytorch OK for training on one GPU using the pretrained model of train_EDVR_woTSA_M.yml

I have use this command training on two GPUs ( nproc_per_node= 2 others not change) and the performance is close to the results in the paper.

So , i have this question. Looking forward to your reply.

Best regards.

taily-khucnaykhongquantrong commented 5 years ago

Just python train.py -opt options/train/train_EDVR_M.yml, that is the command I am using for one GPU

xinntao commented 5 years ago

@yaolugithub young666 has provided the command for training with one GPU. Thanks, @young666.

If you do not change the batch size in the config file, it will cost longer time for each iteration when using fewer GPUs. I think that is why the performance does not decrease when you use two GPUs compared with eight GPUs.

Marshall-yao commented 5 years ago

Thanks very much, @young666. Before you gave your answer, i used the command of python -m torch.distributed.launch --nproc_per_node=1 --master_port=4321 train.py -opt options/train/train_EDVR_M.yml --launcher pytorch and changed num_fworker to 0 ( the initial value is 3 ) to train the network.

@xinntao Yes, i did not change batch size when i trained the code on two Ti X gpus. It costs about 12 days for 600k iteration.

Besides, 1) when i trained on one GPU，which is proper command ,mine or young666's? 2) when i used the above command to trained the code , i changed batch size to 16 for CUDA out of memory and learning rate from 4.00e+4 to 2.00e+4 . Is the learning rate proper ?

Thanks .

taily-khucnaykhongquantrong commented 5 years ago

@yaolugithub if you aren't sure which is proper, you could open the code and see with every command and find out how the model would be trained

xinntao commented 5 years ago

@yaolugithub 1) The command provided by young666 is better, I think. 2) The original setting: batch size 32 for weight GPUs. Each GPU has 4 samples. If the memory allows, you can set 8 or more. I think you can keep the learning rate. But it's better to have a comparison to see which one is better.

Marshall-yao commented 5 years ago

Thanks,xintao. I will have a try about learning rate.

Xintao notifications@github.com 于2019年7月20日周六下午10:45写道：

@yaolugithub https://github.com/yaolugithub

The command provided by young666 is better, I think.

The original setting: batch size 32 for weight GPUs. Each GPU has 4 samples. If the memory allows, you can set 8 or more. I think you can keep the learning rate. But it's better to have a comparison to see which one is better.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xinntao/EDVR/issues/67?email_source=notifications&email_token=AKYUIODB6HDAO6RAM4UHWNTQAMQHZA5CNFSM4IEWKBH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2NPXVY#issuecomment-513473495, or mute the thread https://github.com/notifications/unsubscribe-auth/AKYUIOAF4I2PWY3OZATZZBTQAMQHZANCNFSM4IEWKBHQ .

Marshall-yao commented 5 years ago

@young666
Thanks very much. You have said that you used the command of python train.py -opt options/train/train_EDVR_M.yml to train . Could you tell me if you also changed the n_workers from 3 to 0 ?

Marshall-yao commented 5 years ago

@young666 I have use your training command to train . It trains more quickly with n_worker =3 than 0.

taily-khucnaykhongquantrong commented 5 years ago

@yaolugithub You should read the Pytorch document to understand what n_worker is, and how many is enough

xinntao commented 5 years ago

@yaolugithub The n_workers means the workers for each GPU. Empirically, you do not need to change it when you use a different number of GPUs for training.

Marshall-yao commented 5 years ago

@young666 Thanks very much. The reason that i changed num_workers from 3 to 0 is the code running stuck. So, i googled the reason. Someone said to change num_worker to 0 can avoid this situation.

Marshall-yao commented 4 years ago

@xinntao Thanks very much for your reply.

Situation:

I trained the code with command of python train.py -opt options/train/train_EDVR_M.yml or python -m torch.distributed.launch --nproc_per_node=1 --master_port=4321 train.py -opt options/train/train_EDVR_M.yml --launcher and set the num_workers = 3(not change) on one GPU.

But these two training stuck at certain iterations. Then i refered some documents and set num_worker = 0. Thus , it solved this problem.

However, after changing ,the training speed is about 100s while that is 60s before this change.

Do you know how to improve the training speed ?

xinntao commented 4 years ago

@yaolugithub I think the bottleneck is the IO speed if you set num_worker to 0. You'd better get the multi-process data loader work.

Marshall-yao commented 4 years ago

Thanks ,xintao.

Yes, the num_worker is the parameter of loading samples in a batch size. So, we'd better set it bigger than zero.

But the training process stuck. May be it can solved by modifying the code. But i did not how to modify.

Xintao notifications@github.com 于2019年8月5日周一下午1:25写道：

@yaolugithub https://github.com/yaolugithub I think the bottleneck is the IO speed if you set num_worker to 0. You'd better get the multi-process data loader work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xinntao/EDVR/issues/67?email_source=notifications&email_token=AKYUIOF2TJAUD2GF5P6FWP3QC62V3A5CNFSM4IEWKBH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3QW6CA#issuecomment-518090504, or mute the thread https://github.com/notifications/unsubscribe-auth/AKYUIOAVOTKGVHTPAZJKAXLQC62V3ANCNFSM4IEWKBHQ .