zzh8829 / yolov3-tf2

YoloV3 Implemented in Tensorflow 2.0
MIT License
2.51k stars 909 forks source link

End of sequence #138

Open paapu88 opened 4 years ago

paapu88 commented 4 years ago

When following https://github.com/zzh8829/yolov3-tf2/blob/master/docs/training_voc.md

after

python3 train.py --dataset ./data/voc2012_train.tfrecord --val_dataset ./data/voc2012_val.tfrecord --classes ./data/voc2012.names --num_classes 20 --mode fit --transfer darknet --batch_size 2 --epochs 10 --weights ./checkpoints/yolov3.tf --weights_num_classes 80

I get:

yolo_output_1_loss: 12.8483 - yolo_output_2_loss: 28.81192019-12-29 18:46:01.303069: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence [[{{node IteratorGetNext}}]]

This problem is discussed here: https://github.com/tensorflow/tensorflow/issues/31509

Any suggestions: I have tried tensorflow-gpu==2.0.0 and pip3 install --user tensorflow-gpu==2.1.0rc1

paapu88 commented 4 years ago

Also the same problem when using conda (which has tensorflow-gpu==2.1.0rc1), maybe this has something to do with not much GPU memory?

paapu88 commented 4 years ago

ok,

Training from random weights (NOT RECOMMENDED)

Seems to work, so I'm happy with that

krxat commented 4 years ago

@paapu88 Hi, how did you solve the problem??

paapu88 commented 4 years ago

It crashes every now and then. I restart from lowest validation loss.

paapu88 commented 4 years ago

and: to read old weights one must add to train.py

yolov3-tf2/train.py has been edited to:

    # Configure the model for transfer learning
    if FLAGS.transfer == 'none':
        try:
            model.load_weights(FLAGS.weights)
            print("LOADING OLD WEIGHTS FROM:", FLAGS.weights)
        except:
            print("no weights loaded, starting from scracth")

I restart with python3 train.py --dataset ./data/hurricane_train.tfrecord --val_dataset ./data/hurricane_test.tfrecord --classes ./data/hurricane7.names --num_classes 1 --mode fit --transfer none --batch_size 1 --epochs 20 --size 416 --weights ./checkpoints/yolov3_train_1.tf

kindasweetgal commented 4 years ago

My first contact with the target detection, according to your method successfully solved this problem. But when is the end of the training

paapu88 commented 4 years ago

I just let it run and take the weights with the lowest validation error. This is not the most elegant solution, but did work for me.