wizyoung / YOLOv3_TensorFlow

Complete YOLO v3 TensorFlow implementation. Support training on your own dataset.
MIT License
1.55k stars 579 forks source link

Can you give an example on using the training conf. of the following training strategy of Applying the two-stage training strategy or the one-stage training strategy? #113

Open liminghuiv opened 5 years ago

liminghuiv commented 5 years ago

Hi,

Can you give an example on using the training conf. of the following training strategy? (1) Apply the two-stage training strategy or the one-stage training strategy:

Two-stage training:

First stage: Restore darknet53_body part weights from COCO checkpoints, train the yolov3_head with big learning rate like 1e-3 until the loss reaches to a low level.

Second stage: Restore the weights from the first stage, then train the whole model with small learning rate like 1e-4 or smaller. At this stage remember to restore the optimizer parameters if you use optimizers like adam.

One-stage training:

Just restore the whole weight file except the last three convolution layers (Conv_6, Conv_14, Conv_22). In this condition, be careful about the possible nan loss value.

Thanks and best regards,

Liming

liminghuiv commented 5 years ago

I run you sample VOC configuration, this is what I got:

======> Epoch: 99, global_step: 275899.0, lr: 0.0001 <====== EVAL: Class 0: Recall: 0.9509, Precision: 0.3687, AP: 0.9281 EVAL: Class 1: Recall: 0.9436, Precision: 0.4291, AP: 0.9059 EVAL: Class 2: Recall: 0.9129, Precision: 0.3833, AP: 0.8310 EVAL: Class 3: Recall: 0.8593, Precision: 0.2251, AP: 0.7126 EVAL: Class 4: Recall: 0.8955, Precision: 0.1733, AP: 0.7592 EVAL: Class 5: Recall: 0.9671, Precision: 0.3679, AP: 0.9396 EVAL: Class 6: Recall: 0.9692, Precision: 0.3767, AP: 0.9116 EVAL: Class 7: Recall: 0.9385, Precision: 0.4571, AP: 0.9120 EVAL: Class 8: Recall: 0.8889, Precision: 0.2113, AP: 0.7045 EVAL: Class 9: Recall: 0.9221, Precision: 0.3577, AP: 0.8469 EVAL: Class 10: Recall: 0.9417, Precision: 0.2172, AP: 0.7786 EVAL: Class 11: Recall: 0.9571, Precision: 0.4167, AP: 0.9119 EVAL: Class 12: Recall: 0.9310, Precision: 0.3608, AP: 0.9061 EVAL: Class 13: Recall: 0.9385, Precision: 0.3661, AP: 0.9073 EVAL: Class 14: Recall: 0.9302, Precision: 0.4981, AP: 0.8813 EVAL: Class 15: Recall: 0.7792, Precision: 0.1996, AP: 0.5394 EVAL: Class 16: Recall: 0.9215, Precision: 0.2217, AP: 0.8436 EVAL: Class 17: Recall: 0.9540, Precision: 0.2495, AP: 0.8292 EVAL: Class 18: Recall: 0.9362, Precision: 0.3900, AP: 0.8900 EVAL: Class 19: Recall: 0.8961, Precision: 0.2848, AP: 0.7887 EVAL: Recall: 0.9246, Precison: 0.3495, mAP: 0.8364 EVAL: loss: total: 4.76, xy: 0.33, wh: 0.16, conf: 3.39, class: 0.88 Is the above result reasonable? It is different from what you got: I got a 87.54% test mAP (not using the 07 metric).

wizyoung commented 5 years ago

Results on epoch 99 is kind of overfitting. I got 87.54% mAP after 36 epochs. Here is my training logs for your reference: training.log

liminghuiv commented 5 years ago

thanks for the quick reply. pls see the attachment for my train.py, args.py (in txt extension file) and progress.log. Did I do anything wrong? progress.log

args_voc.txt train_voc.txt

lovepan1 commented 5 years ago

======> Epoch: 8, global_step: 33209.0, lr: 0.0001 <====== EVAL: Class 0: Recall: 0.9164, Precision: 0.0949, AP: 0.8745 EVAL: Class 1: Recall: 0.8766, Precision: 0.0888, AP: 0.7734 EVAL: Class 2: Recall: 0.8802, Precision: 0.0969, AP: 0.7795 EVAL: Class 3: Recall: 0.8397, Precision: 0.0487, AP: 0.6078 EVAL: Class 4: Recall: 0.8143, Precision: 0.0601, AP: 0.6424 EVAL: Class 5: Recall: 0.9252, Precision: 0.0613, AP: 0.8753 EVAL: Class 6: Recall: 0.9448, Precision: 0.1212, AP: 0.8627 EVAL: Class 7: Recall: 0.9378, Precision: 0.1001, AP: 0.9063 EVAL: Class 8: Recall: 0.8443, Precision: 0.1052, AP: 0.6356 EVAL: Class 9: Recall: 0.9331, Precision: 0.0577, AP: 0.8472 EVAL: Class 10: Recall: 0.8763, Precision: 0.0441, AP: 0.6926 EVAL: Class 11: Recall: 0.9528, Precision: 0.1576, AP: 0.8723 EVAL: Class 12: Recall: 0.9468, Precision: 0.0981, AP: 0.8533 EVAL: Class 13: Recall: 0.9079, Precision: 0.0610, AP: 0.8196 EVAL: Class 14: Recall: 0.9283, Precision: 0.1743, AP: 0.8330 EVAL: Class 15: Recall: 0.8159, Precision: 0.0503, AP: 0.5252 EVAL: Class 16: Recall: 0.8907, Precision: 0.0462, AP: 0.7814 EVAL: Class 17: Recall: 0.9268, Precision: 0.1195, AP: 0.7236 EVAL: Class 18: Recall: 0.9570, Precision: 0.0689, AP: 0.8745 EVAL: Class 19: Recall: 0.9446, Precision: 0.0538, AP: 0.8228 EVAL: Recall: 0.9073, Precison: 0.0978, mAP: 0.7801 EVAL: loss: total: 6.42, xy: 0.54, wh: 0.34, conf: 4.12, class: 1.42 this is my best map, i also met this appearance.

lovepan1 commented 5 years ago

first stage: i restore yolov3 darknet weights and update yolov3_head, second stage: i update darknet53 and yolov3_head this is my train process

liminghuiv commented 5 years ago

Thanks. Can you share your two stages' arg.py files?

lovepan1 commented 5 years ago

this is my arg.py: first_stage: use darknet weights, restore darknet53, update yolov3_head second_stage: use first_stage trained weights , restored darkent53 and yolov3_head, update darkent53 and yolov3_head first_stage.txt second_stage.txt

liminghuiv commented 5 years ago

Hi, @wizyoung. Can you pls review the one stage and two stage args.py files? and give us some suggestions? Thanks.

wizyoung commented 5 years ago

@liminghuiv Are your training and test txt files correct? Here are my txt files: train.txt val.txt

I hope you make efforts to understand the yolo v3 model and its parameters, and finetune the model yourself.

liminghuiv commented 5 years ago

@wizyong, I used your misc/experiments_on_voc script and data. The training/Val text files are exactly the same. Thanks a lot.

liminghuiv commented 5 years ago

this is my arg.py: first_stage: use darknet weights, restore darknet53, update yolov3_head second_stage: use first_stage trained weights , restored darkent53 and yolov3_head, update darkent53 and yolov3_head first_stage.txt second_stage.txt

Hi @lovepan1, it seems that you did not use higher learning rate (i.e. 1e-3) at the first stage, and lower learning rate at the second stage (<1e-4) according to the readme?

lovepan1 commented 5 years ago

this is my arg.py: first_stage: use darknet weights, restore darknet53, update yolov3_head second_stage: use first_stage trained weights , restored darkent53 and yolov3_head, update darkent53 and yolov3_head first_stage.txt second_stage.txt

Hi @lovepan1, it seems that you did not use higher learning rate (i.e. 1e-3) at the first stage, and lower learning rate at the second stage (<1e-4) according to the readme?

ok, i will use the appropriate lr to train my model, thanks a lot

liminghuiv commented 5 years ago

@lovepan1 , hope it works. Can you update your running result when you finish the running?

lovepan1 commented 5 years ago

@lovepan1 , hope it works. Can you update your running result when you finish the running?

ok, in this weekend, i will retrain my model to use appropriate lr, hope it works, thank you.

zyc4me commented 4 years ago

@liminghuiv @wizyoung @lovepan1 hi guys, i have met the same problem, i use the /misc/experiments_on_voc/args_voc.py, and do not use two stage train, just use one stage, my log is similar with @liminghuiv , after several epochs, the training conf_loss and class_loss is very small ,like 0.02 0.35, bug is very different with @wizyoung 1.5x 2.2x..., so can you help me found the problem? @wizyoung

mew124 commented 4 years ago

@zyc4me Did you find a solution? I also use one stage training and met the same problem.

bujianyiwang commented 4 years ago

I want to use yolov3 to predict the number of real-time person from the rtsp stream with suited interval,anyone has the right python file?

aryuCoding commented 4 years ago

Results on epoch 99 is kind of overfitting. I got 87.54% mAP after 36 epochs. Here is my training logs for your reference: training.log

Mon, 01 Jul 2019 08:50:21 INFO Epoch: 50, global_step: 140400 | loss: total: 4.13, xy: 0.32, wh: 0.23, conf: 1.48, class: 2.10 | Last batch: rec: 0.937, prec: 0.007 | lr: 0.0001 Mon, 01 Jul 2019 08:50:56 INFO Epoch: 50, global_step: 140500 | loss: total: 4.12, xy: 0.32, wh: 0.23, conf: 1.48, class: 2.10 | Last batch: rec: 0.857, prec: 0.014 | lr: 0.0001 Mon, 01 Jul 2019 08:51:39 INFO Epoch: 50, global_step: 140600 | loss: total: 4.14, xy: 0.32, wh: 0.23, conf: 1.48, class: 2.11 | Last batch: rec: 0.786, prec: 0.014 | lr: 0.0001 Mon, 01 Jul 2019 08:52:30 INFO Epoch: 50, global_step: 140700 | loss: total: 4.15, xy: 0.32, wh: 0.23, conf: 1.49, class: 2.11 | Last batch: rec: 0.867, prec: 0.012 | lr: 0.0001 Mon, 01 Jul 2019 08:56:02 INFO ======> Epoch: 50, global_step: 140708.0, lr: 0.0001 <======

According to your training logs, recall is very high but precision is low. Is that normal?