Open Hsengiv2000 opened 4 years ago
That can be several things, but it does really sound like your box labels might be larger than 1 or smaller than 0, which is a problem i have run into a couple of times.
It could also be something with the amount of classes that you have, you say you are using 1 single class? Have you followed the guide about transfer learning so that you reduce the 80 classes the standard weights have (if you are not training from scratch) down to 1?
That can be several things, but it does really sound like your box labels might be larger than 1 or smaller than 0, which is a problem i have run into a couple of times.
It could also be something with the amount of classes that you have, you say you are using 1 single class? Have you followed the guide about transfer learning so that you reduce the 80 classes the standard weights have (if you are not training from scratch) down to 1?
Are there more steps than just changing "num_classes" to be 1? if there are, could you send me a link to it? Thank you so much.
Are there more steps than just changing "num_classes" to be 1? if there are, could you send me a link to it? Thank you so much.
Yes, if you are using weights you need to set the weight_num_classes to 80. Here is a link: https://github.com/zzh8829/yolov3-tf2/blob/master/docs/training_voc.md (Its under transfer learning)
What are the flags that you are using right now?
Are there more steps than just changing "num_classes" to be 1? if there are, could you send me a link to it? Thank you so much.
Yes, if you are using weights you need to set the weight_num_classes to 80. Here is a link: https://github.com/zzh8829/yolov3-tf2/blob/master/docs/training_voc.md (Its under transfer learning)
What are the flags that you are using right now?
I have used the exact same command line arguements as given in the doc, except i changed the num_classes to 1
!python train.py \ --dataset ./tfrecord/train.tfrecord \ --val_dataset ./tfrecord/valid.tfrecord \ --classes ./data/sat.names \ --num_classes 1 \ --mode fit --transfer darknet \ --batch_size 16 \ --epochs 3 \ --weights ./checkpoints/yolov3.tf \ --weights_num_classes 80
The only change I made aside from this was changing the model.py Flag.max_boxes to 4000 ` flags.DEFINE_integer('yolo_max_boxes', 4000, 'maximum number of boxes per image') flags.DEFINE_float('yolo_iou_threshold', 0.5, 'iou threshold') flags.DEFINE_float('yolo_score_threshold', 0.5, 'score threshold')
yolo_anchors = np.array([(10, 13), (16, 30), (33, 23), (30, 61), (62, 45), (59, 119), (116, 90), (156, 198), (373, 326)], np.float32) / 416 yolo_anchor_masks = np.array([[6, 7, 8], [3, 4, 5], [0, 1, 2]])
yolo_tiny_anchors = np.array([(10, 14), (23, 27), (37, 58), (81, 82), (135, 169), (344, 319)], np.float32) / 416 yolo_tiny_anchor_masks = np.array([[3, 4, 5], [0, 1, 2]])`
\ --classes ./data/sat.names
What is the content of your sat.names?
\ --classes ./data/sat.names
What is the content of your sat.names?
just "car". It works fine with the visualize_detection so i assumed the sat.names is fine.
Could you post a sample of one of your labels, perhaps that could give me some insight. If i were you id check all your labels to 100% certain that under no circumstance does any of the values go out of the 0-1 range
Could you post a sample of one of your labels, perhaps that could give me some insight. If i were you id check all your labels to 100% certain that under no circumstance does any of the values go out of the 0-1 range
BY labels, if you are referring to the annotation file, it is in a .tfrecord file, i parsed them and viewed them, below is an example https://drive.google.com/file/d/1sVfWMivktowMmDM91lsj4395rC7Hh8jw/view?usp=sharing
You need to scroll down an extent to view the coords/labels etc. the top of the txt file is just bytevalues.
Not that its an issue, but it does seem like in this labelfile that almost all of your labels are en the bottom left corner of your image.
I dont see something inherently wrong with the labels, however what you can do is run the detect python script in a loop and see if it crashes on one of the labels. Though the easiest way of checking this is going to the source label file (usually an XML file and loop through all of them running some checks on smaller than 0 and larger than 1.
If this is not the problem then i'm not sure what it could be
Not that its an issue, but it does seem like in this labelfile that almost all of your labels are en the bottom left corner of your image.
I dont see something inherently wrong with the labels, however what you can do is run the detect python script in a loop and see if it crashes on one of the labels. Though the easiest way of checking this is going to the source label file (usually an XML file and loop through all of them running some checks on smaller than 0 and larger than 1.
If this is not the problem then i'm not sure what it could be
I went through all the images both training and validaiton, and all worked perfectly. do you think its to do with me changing my max_boxes to 4000?
Okay i fixed it by dividing my 3000x3000 image into multiple smaller 300x300 images .
Hi, I have a very similar problem. When I train with 60 training images and 15 validation images, it works. But when I train with 400 training images and 100 validation images then I get kind of the same error as you do. The images have the size 600x600 pixels and get resized to 416.
Is this some kind of memory issue? Does anyone have a solution?
Hi, I have a very similar problem. When I train with 60 training images and 15 validation images, it works. But when I train with 400 training images and 100 validation images then I get kind of the same error as you do. The images have the size 600x600 pixels and get resized to 416.
Is this some kind of memory issue? Does anyone have a solution?
Nevermind, some of my bouonding box coordinates were outside the image. After fixing my annotations, everything is fine now.
Hi, I have a very similar problem. When I train with 60 training images and 15 validation images, it works. But when I train with 400 training images and 100 validation images then I get kind of the same error as you do. The images have the size 600x600 pixels and get resized to 416. Is this some kind of memory issue? Does anyone have a solution?
Nevermind, some of my bouonding box coordinates were outside the image. After fixing my annotations, everything is fine now.
I was working on SVHN dataset and faced the same problem and when I looked up annotations, found that bounding boxes are indeed stretching out of the image dimensions for some images.
Hi, I find this problem is attributed to data, if anyone meet this, you should check data annotation carefully, note xmax must bigger than xmin, and ymax must bigger than ymin, and all xmax、ymax、xmin、ymin must in the image.
I am currently training this on the xview dataset for just 1 class. To note, each image is around 3000^2 pixels and each object in the image is super small around 10x10. some images even contain 3900 objects per image. The only change i made was in the model.py file where I changed the max_boxes to 4000 (instead of 100 as that kept giving a padding cant be negative error ) .
The visualize_Dataset works perfectly which means my dataset is fine.
while training im running into this error `Invalid argument: indices[3824] = [10, 1, 52, 0] does not index into shape [16,52,52,3,6] Traceback (most recent call last): File "train.py", line 193, in
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "train.py", line 188, in main
validation_data=val_dataset)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
return method(self, *args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 848, in fit
tmp_logs = train_function(iterator)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 580, in call
result = self._call(*args, *kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 644, in _call
return self._stateless_fn(args, kwds)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2420, in call
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1665, in _filtered_call
self.captured_inputs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 598, in call
ctx=ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: [Derived] indices[3824] = [10, 1, 52, 0] does not index into shape [16,52,52,3,6]
[[{{node TensorScatterUpdate}}]]
[[PartitionedCall_2]]
[[IteratorGetNext]] [Op:__inference_train_function_44222]
Function call stack: train_function -> train_function -> train_function -> train_function
WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-8 W0519 05:54:04.077612 140431398807424 util.py:144] Unresolved object in checkpoint: (root).layer-8 WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-9 W0519 05:54:04.077891 140431398807424 util.py:144] Unresolved object in checkpoint: (root).layer-9 WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-10 W0519 05:54:04.077988 140431398807424 util.py:144] Unresolved object in checkpoint: (root).layer-10 WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-11 W0519 05:54:04.078065 140431398807424 util.py:144] Unresolved object in checkpoint: (root).layer-11 WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details. W0519 05:54:04.078219 140431398807424 util.py:152] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.`
I dont know why it crashes after the first epoch, which does not even run. Any answers would be helpful! Thanks.