Training loss goes into nan values

youngwanLEE / vovnet-detectron2

[CVPR 2020] VoVNet backbone networks for detectron2

Other

378 stars 69 forks source link

Training loss goes into nan values #3

Closed Samjith888 closed 4 years ago

Samjith888 commented 4 years ago

I got nan values when used the default config in vovnet. Then i tried by reducing the bs_lr into 0.001 , 0.00025 .Hence the nan value issue solved, but the training loss not reducing (training loss starts from 1.9 to and reached in 0.7) , the AP is 11 for 75000 iterations.

Dataset : 57000 images with one class , those images are in different resolutions.

youngwanLEE commented 4 years ago

@Samjith888

The default hyperparameters of centermask are tuned to COCO dataset using 16 batch size on 8GPUs.

You have to find proper hyperparameters for your own dataset and own environment.

How many GPUs and batch size you use?

When you change batch size for training, you have to adjust base_lr.

I recommend how to adjust hyperparameters as below,

Reducing bs_lr
increasing WARMUP_ITERS
increasing batch size ASAP
change a backbone to lightweight models

Samjith888 commented 4 years ago

@Samjith888

The default hyperparameters of centermask are tuned to COCO dataset using 16 batch size on 8GPUs.

You have to find proper hyperparameters for your own dataset and own environment.

How many GPUs and batch size you use?

When you change batch size for training, you have to adjust base_lr.

I recommend how to adjust hyperparameters as below,
* Reducing bs_lr

* increasing WARMUP_ITERS

* increasing batch size ASAP

* change a backbone to lightweight models

Dataset = 52000 images. One class . 3.5 lakh instances. Num_GPU =1 Then What batch size and base_lr should prefer ?

youngwanLEE commented 4 years ago

I don't know which settings are best for all environments.

But, I recommend using as many the batch size as your GPUs can handle.