zzh8829 / yolov3-tf2

YoloV3 Implemented in Tensorflow 2.0
MIT License
2.51k stars 905 forks source link

Some help needed in training #247

Open jarjuk opened 4 years ago

jarjuk commented 4 years ago

First thank you for this nice and clean implementation!

I have created a small repo https://github.com/jarjuk/yolov3-tf2-training documenting, how I have Dockerized your implementation (marcus2002/yolov3-tf2-training:0) and used it to train VOC2012 imageset an Amazon g4dn.xlarge instance.

Training was interrupted twice by "early stopping", and the result performed poorer compared to using original darknet weights.

Could you, please give me some advice, how to achieve better training results?

I have tried to document all the steps in emacs org documents (docker.org and aws.org) in the repository to make it easier to give advice.

BR, Jukka

Morgensol commented 4 years ago

First thank you for this nice and clean implementation!

I have created a small repo https://github.com/jarjuk/yolov3-tf2-training documenting, how I have Dockerized your implementation (marcus2002/yolov3-tf2-training:0) and used it to train VOC2012 imageset an Amazon g4dn.xlarge instance.

Training was interrupted twice by "early stopping", and the result performed poorer compared to using original darknet weights.

Could you, please give me some advice, how to achieve better training results?

I have tried to document all the steps in emacs org documents (docker.org and aws.org) in the repository to make it easier to give advice.

BR, Jukka

Early stopping is there to help the model not overfitting on data, if it early stops too early for you you can set the patience higher or even comment it out and manually monitor when the validation converges.

For the second part it all depends on what you are using as parameters, like learning rate or what transfer mode you have set, so i think you need to specify that if you need help

jarjuk commented 4 years ago

I have trained voc2012 in two session:

In both sessions learning rate default value (1e-3).

In python code:

       python train.py \
    --dataset ./voc.data/voc2012_train.tfrecord \
    --val_dataset ./voc.data/voc2012_val.tfrecord \
    --weights ./voc.data/yolov3-cnv.tf \
    --classes ./data/voc2012.names \
    --num_classes 20 \
    --mode fit \
        --transfer darknet \
    --batch_size 16 \
    --epochs 10 \
    --weights_num_classes 80 

And for session 2:

        python train.py \
     --dataset ./voc.data/voc2012_train.tfrecord \
     --val_dataset ./voc.data/voc2012_val.tfrecord \
     --weights ./voc.data/cont_20.tf \
     --classes ./data/voc2012.names \
     --num_classes 20 \
     --mode fit \
         --transfer fine_tune \
     --batch_size 16 \
     --epochs 10 \
     --weights_num_classes 20 

The result is working, but not as well as, when using the original darknet weights. Refer https://github.com/jarjuk/yolov3-tf2-training#detection-results for a comparison

I am novice in DNNs and would like to understand the best strategy to train DNN, and ylov3-tf2 in particular.

Morgensol commented 4 years ago

I notice a couple of things: