Some help needed in training

jarjuk commented 4 years ago

First thank you for this nice and clean implementation!

I have created a small repo https://github.com/jarjuk/yolov3-tf2-training documenting, how I have Dockerized your implementation (marcus2002/yolov3-tf2-training:0) and used it to train VOC2012 imageset an Amazon g4dn.xlarge instance.

Training was interrupted twice by "early stopping", and the result performed poorer compared to using original darknet weights.

Could you, please give me some advice, how to achieve better training results?

I have tried to document all the steps in emacs org documents (docker.org and aws.org) in the repository to make it easier to give advice.

BR, Jukka

Morgensol commented 4 years ago

First thank you for this nice and clean implementation!

I have created a small repo https://github.com/jarjuk/yolov3-tf2-training documenting, how I have Dockerized your implementation (marcus2002/yolov3-tf2-training:0) and used it to train VOC2012 imageset an Amazon g4dn.xlarge instance.

Training was interrupted twice by "early stopping", and the result performed poorer compared to using original darknet weights.

Could you, please give me some advice, how to achieve better training results?

I have tried to document all the steps in emacs org documents (docker.org and aws.org) in the repository to make it easier to give advice.

BR, Jukka

Early stopping is there to help the model not overfitting on data, if it early stops too early for you you can set the patience higher or even comment it out and manually monitor when the validation converges.

For the second part it all depends on what you are using as parameters, like learning rate or what transfer mode you have set, so i think you need to specify that if you need help

jarjuk commented 4 years ago

I have trained voc2012 in two session:

session 1 (as documented in yolov3-tf2/docs/training.voc section training/with transfer learning)
session 2 (continuing with last checkpoint from session 1, transfer: fine_tune, mode fit)

In both sessions learning rate default value (1e-3).

In python code:

       python train.py \
    --dataset ./voc.data/voc2012_train.tfrecord \
    --val_dataset ./voc.data/voc2012_val.tfrecord \
    --weights ./voc.data/yolov3-cnv.tf \
    --classes ./data/voc2012.names \
    --num_classes 20 \
    --mode fit \
        --transfer darknet \
    --batch_size 16 \
    --epochs 10 \
    --weights_num_classes 80

And for session 2:

        python train.py \
     --dataset ./voc.data/voc2012_train.tfrecord \
     --val_dataset ./voc.data/voc2012_val.tfrecord \
     --weights ./voc.data/cont_20.tf \
     --classes ./data/voc2012.names \
     --num_classes 20 \
     --mode fit \
         --transfer fine_tune \
     --batch_size 16 \
     --epochs 10 \
     --weights_num_classes 20

The result is working, but not as well as, when using the original darknet weights. Refer https://github.com/jarjuk/yolov3-tf2-training#detection-results for a comparison

I am novice in DNNs and would like to understand the best strategy to train DNN, and ylov3-tf2 in particular.

Morgensol commented 4 years ago

I notice a couple of things:

As far as im aware the "fine_tune" cuts off your last layer, of which would put you back some significant ammount, if you want to transfer learn like that, and resume on a checkpoint i'd recommend looking at #38 or #154.
I also assume that the weights you use on your first run is the official weights trained on Imagenet, and the weights you use on your second are the transfered weights from first run.
Also another thing, you dont have to specify wights_num_classes in the second run if you did the transfer learning correctly.
And lastly you havent specified the image size, are you sure that the images are the size you want them?

zzh8829 / yolov3-tf2

Some help needed in training #247