zzh8829 / yolov3-tf2

YoloV3 Implemented in Tensorflow 2.0
MIT License
2.51k stars 909 forks source link

Invalid argument: indices[3824] = [10, 1, 52, 0] does not index into shape [16,52,52,3,6] while training on xView Dataset 1 class #274

Open Hsengiv2000 opened 4 years ago

Hsengiv2000 commented 4 years ago

I am currently training this on the xview dataset for just 1 class. To note, each image is around 3000^2 pixels and each object in the image is super small around 10x10. some images even contain 3900 objects per image. The only change i made was in the model.py file where I changed the max_boxes to 4000 (instead of 100 as that kept giving a padding cant be negative error ) .

The visualize_Dataset works perfectly which means my dataset is fine.

while training im running into this error `Invalid argument: indices[3824] = [10, 1, 52, 0] does not index into shape [16,52,52,3,6] Traceback (most recent call last): File "train.py", line 193, in app.run(main) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "train.py", line 188, in main validation_data=val_dataset) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper return method(self, *args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 848, in fit tmp_logs = train_function(iterator) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 580, in call result = self._call(*args, *kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 644, in _call return self._stateless_fn(args, kwds) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2420, in call return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1665, in _filtered_call self.captured_inputs) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 598, in call ctx=ctx) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.InvalidArgumentError: [Derived] indices[3824] = [10, 1, 52, 0] does not index into shape [16,52,52,3,6] [[{{node TensorScatterUpdate}}]] [[PartitionedCall_2]] [[IteratorGetNext]] [Op:__inference_train_function_44222]

Function call stack: train_function -> train_function -> train_function -> train_function

WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-8 W0519 05:54:04.077612 140431398807424 util.py:144] Unresolved object in checkpoint: (root).layer-8 WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-9 W0519 05:54:04.077891 140431398807424 util.py:144] Unresolved object in checkpoint: (root).layer-9 WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-10 W0519 05:54:04.077988 140431398807424 util.py:144] Unresolved object in checkpoint: (root).layer-10 WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-11 W0519 05:54:04.078065 140431398807424 util.py:144] Unresolved object in checkpoint: (root).layer-11 WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details. W0519 05:54:04.078219 140431398807424 util.py:152] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.`

I dont know why it crashes after the first epoch, which does not even run. Any answers would be helpful! Thanks.

Morgensol commented 4 years ago

That can be several things, but it does really sound like your box labels might be larger than 1 or smaller than 0, which is a problem i have run into a couple of times.

It could also be something with the amount of classes that you have, you say you are using 1 single class? Have you followed the guide about transfer learning so that you reduce the 80 classes the standard weights have (if you are not training from scratch) down to 1?

Hsengiv2000 commented 4 years ago

That can be several things, but it does really sound like your box labels might be larger than 1 or smaller than 0, which is a problem i have run into a couple of times.

It could also be something with the amount of classes that you have, you say you are using 1 single class? Have you followed the guide about transfer learning so that you reduce the 80 classes the standard weights have (if you are not training from scratch) down to 1?

Are there more steps than just changing "num_classes" to be 1? if there are, could you send me a link to it? Thank you so much.

Morgensol commented 4 years ago

Are there more steps than just changing "num_classes" to be 1? if there are, could you send me a link to it? Thank you so much.

Yes, if you are using weights you need to set the weight_num_classes to 80. Here is a link: https://github.com/zzh8829/yolov3-tf2/blob/master/docs/training_voc.md (Its under transfer learning)

What are the flags that you are using right now?

Hsengiv2000 commented 4 years ago

Are there more steps than just changing "num_classes" to be 1? if there are, could you send me a link to it? Thank you so much.

Yes, if you are using weights you need to set the weight_num_classes to 80. Here is a link: https://github.com/zzh8829/yolov3-tf2/blob/master/docs/training_voc.md (Its under transfer learning)

What are the flags that you are using right now?

I have used the exact same command line arguements as given in the doc, except i changed the num_classes to 1 !python train.py \ --dataset ./tfrecord/train.tfrecord \ --val_dataset ./tfrecord/valid.tfrecord \ --classes ./data/sat.names \ --num_classes 1 \ --mode fit --transfer darknet \ --batch_size 16 \ --epochs 3 \ --weights ./checkpoints/yolov3.tf \ --weights_num_classes 80

The only change I made aside from this was changing the model.py Flag.max_boxes to 4000 ` flags.DEFINE_integer('yolo_max_boxes', 4000, 'maximum number of boxes per image') flags.DEFINE_float('yolo_iou_threshold', 0.5, 'iou threshold') flags.DEFINE_float('yolo_score_threshold', 0.5, 'score threshold')

yolo_anchors = np.array([(10, 13), (16, 30), (33, 23), (30, 61), (62, 45), (59, 119), (116, 90), (156, 198), (373, 326)], np.float32) / 416 yolo_anchor_masks = np.array([[6, 7, 8], [3, 4, 5], [0, 1, 2]])

yolo_tiny_anchors = np.array([(10, 14), (23, 27), (37, 58), (81, 82), (135, 169), (344, 319)], np.float32) / 416 yolo_tiny_anchor_masks = np.array([[3, 4, 5], [0, 1, 2]])`

Morgensol commented 4 years ago

\ --classes ./data/sat.names

What is the content of your sat.names?

Hsengiv2000 commented 4 years ago

\ --classes ./data/sat.names

What is the content of your sat.names?

just "car". It works fine with the visualize_detection so i assumed the sat.names is fine.

Morgensol commented 4 years ago

Could you post a sample of one of your labels, perhaps that could give me some insight. If i were you id check all your labels to 100% certain that under no circumstance does any of the values go out of the 0-1 range

Hsengiv2000 commented 4 years ago

Could you post a sample of one of your labels, perhaps that could give me some insight. If i were you id check all your labels to 100% certain that under no circumstance does any of the values go out of the 0-1 range

BY labels, if you are referring to the annotation file, it is in a .tfrecord file, i parsed them and viewed them, below is an example https://drive.google.com/file/d/1sVfWMivktowMmDM91lsj4395rC7Hh8jw/view?usp=sharing

You need to scroll down an extent to view the coords/labels etc. the top of the txt file is just bytevalues.

Morgensol commented 4 years ago

Not that its an issue, but it does seem like in this labelfile that almost all of your labels are en the bottom left corner of your image.

I dont see something inherently wrong with the labels, however what you can do is run the detect python script in a loop and see if it crashes on one of the labels. Though the easiest way of checking this is going to the source label file (usually an XML file and loop through all of them running some checks on smaller than 0 and larger than 1.

If this is not the problem then i'm not sure what it could be

Hsengiv2000 commented 4 years ago

Not that its an issue, but it does seem like in this labelfile that almost all of your labels are en the bottom left corner of your image.

I dont see something inherently wrong with the labels, however what you can do is run the detect python script in a loop and see if it crashes on one of the labels. Though the easiest way of checking this is going to the source label file (usually an XML file and loop through all of them running some checks on smaller than 0 and larger than 1.

If this is not the problem then i'm not sure what it could be

I went through all the images both training and validaiton, and all worked perfectly. do you think its to do with me changing my max_boxes to 4000?

Hsengiv2000 commented 4 years ago

Okay i fixed it by dividing my 3000x3000 image into multiple smaller 300x300 images .

NorbertDorbert commented 4 years ago

Hi, I have a very similar problem. When I train with 60 training images and 15 validation images, it works. But when I train with 400 training images and 100 validation images then I get kind of the same error as you do. The images have the size 600x600 pixels and get resized to 416.

Is this some kind of memory issue? Does anyone have a solution?

NorbertDorbert commented 4 years ago

Hi, I have a very similar problem. When I train with 60 training images and 15 validation images, it works. But when I train with 400 training images and 100 validation images then I get kind of the same error as you do. The images have the size 600x600 pixels and get resized to 416.

Is this some kind of memory issue? Does anyone have a solution?

Nevermind, some of my bouonding box coordinates were outside the image. After fixing my annotations, everything is fine now.

abhishek-verma-github commented 4 years ago

Hi, I have a very similar problem. When I train with 60 training images and 15 validation images, it works. But when I train with 400 training images and 100 validation images then I get kind of the same error as you do. The images have the size 600x600 pixels and get resized to 416. Is this some kind of memory issue? Does anyone have a solution?

Nevermind, some of my bouonding box coordinates were outside the image. After fixing my annotations, everything is fine now.

I was working on SVHN dataset and faced the same problem and when I looked up annotations, found that bounding boxes are indeed stretching out of the image dimensions for some images.

Anhaoxu commented 4 years ago

Hi, I find this problem is attributed to data, if anyone meet this, you should check data annotation carefully, note xmax must bigger than xmin, and ymax must bigger than ymin, and all xmax、ymax、xmin、ymin must in the image.