zzh8829 / yolov3-tf2

YoloV3 Implemented in Tensorflow 2.0
MIT License
2.51k stars 909 forks source link

训练过程中出现问题: #208

Open Eve66666 opened 4 years ago

Eve66666 commented 4 years ago

指令:python train.py --dataset ./data/Face/face_train.tfrecord --val_dataset ./data/Face/face_val.tfrecord --classes ./data/Face/face_label.names --num_classes 2 --mode fit --transfer darknet --batch_size 2 --epochs 10 --weights ./checkpoints/yolov3.tf --weights_num_classes 80 我的数据集通过voc_2012.py变为tfrecord格式,用visual_dataset.py也能从中随机抽取图片,并精确的标注出了目标,但是每当我训练的时候,到了第420次时,总出现以下问题,请问有谁知道问题出在哪里吗?

2020-03-17 22:47:23.177447: W tensorflow/core/common_runtime/bfc_allocator.cc:305] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature. 2020-03-17 22:47:23.956383: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started. 2020-03-17 22:47:23.990373: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cupti64_100.dll'; dlerror: cupti64_100.dll not found 2020-03-17 22:47:23.999044: W tensorflow/core/profiler/lib/profiler_session.cc:192] Encountered error while starting profiler: Unavailable: CUPTI error: CUPTI could not be loaded or symbol could not be found. 1/Unknown - 31s 31s/step - loss: 8142.8374 - yolo_output_0_loss: 372.0213 - yolo_output_1_loss: 1816.4917 - yolo_output_2_loss: 5943.74172020-03-17 22:47:25.453912: I tensorflow/core/platform/default/device_tracer.cc:588] Collecting 0 kernel records, 0 memcpy records. 2020-03-17 22:47:25.525874: E tensorflow/core/platform/default/device_tracer.cc:70] CUPTI error: CUPTI could not be loaded or symbol could not be found. WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.707839). Check your callbacks. W0317 22:47:25.640211 22200 callbacks.py:244] Method (on_train_batch_end) is slow compared to the batch update (0.707839). Check your callbacks. 420/Unknown - 120s 287ms/step - loss: 279.3950 - yolo_output_0_loss: 11.9883 - yolo_output_1_loss: 55.4188 - yolo_output_2_loss: 201.05982020-03-17 22:48:53.685180: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Invalid argument: {{function_node __inference_Datasetmap_14255}} Paddings must be non-negative: 0 -128 [[{{node Pad}}]] [[IteratorGetNext]] [[Shape/_10]] 2020-03-17 22:48:53.685178: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Invalid argument: {{function_node __inference_Datasetmap_14255}} Paddings must be non-negative: 0 -128 [[{{node Pad}}]] [[IteratorGetNext]] 421/Unknown - 120s 286ms/step - loss: 279.3950 - yolo_output_0_loss: 11.9883 - yolo_output_1_loss: 55.4188 - yolo_output_2_loss: 201.0598WARNING:tensorflow:Reduce LR on plateau conditioned on metric val_loss which is not available. Available metrics are: loss,yolo_output_0_loss,yolo_output_1_loss,yolo_output_2_loss,lr W0317 22:48:53.721623 22200 callbacks.py:1824] Reduce LR on plateau conditioned on metric val_loss which is not available. Available metrics are: loss,yolo_output_0_loss,yolo_output_1_loss,yolo_output_2_loss,lr WARNING:tensorflow:Early stopping conditioned on metric val_loss which is not available. Available metrics are: loss,yolo_output_0_loss,yolo_output_1_loss,yolo_output_2_loss,lr W0317 22:48:53.722620 22200 callbacks.py:1250] Early stopping conditioned on metric val_loss which is not available. Available metrics are: loss,yolo_output_0_loss,yolo_output_1_loss,yolo_output_2_loss,lr

Epoch 00001: saving model to checkpoints/yolov3_train_1.tf 421/Unknown - 127s 301ms/step - loss: 279.3950 - yolo_output_0_loss: 11.9883 - yolo_output_1_loss: 55.4188 - yolo_output_2_loss: 201.0598Traceback (most recent call last): File "train.py", line 193, in app.run(main) File "E:\anaconda\envs\yolov3\lib\site-packages\absl\app.py", line 299, in run _run_main(main, args) File "E:\anaconda\envs\yolov3\lib\site-packages\absl\app.py", line 250, in _run_main sys.exit(main(argv)) File "train.py", line 188, in main validation_data=val_dataset) File "E:\anaconda\envs\yolov3\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 728, in fit use_multiprocessing=use_multiprocessing) File "E:\anaconda\envs\yolov3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 324, in fit total_epochs=epochs) File "E:\anaconda\envs\yolov3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 123, in run_one_epoch batch_outs = execution_function(iterator) File "E:\anaconda\envs\yolov3\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 86, in execution_function distributed_function(input_fn)) File "E:\anaconda\envs\yolov3\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 457, in call result = self._call(*args, *kwds) File "E:\anaconda\envs\yolov3\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 487, in _call return self._stateless_fn(args, **kwds) # pylint: disable=not-callable File "E:\anaconda\envs\yolov3\lib\site-packages\tensorflow_core\python\eager\function.py", line 1823, in call return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access File "E:\anaconda\envs\yolov3\lib\site-packages\tensorflow_core\python\eager\function.py", line 1141, in _filtered_call self.captured_inputs) File "E:\anaconda\envs\yolov3\lib\site-packages\tensorflow_core\python\eager\function.py", line 1224, in _call_flat ctx, args, cancellation_manager=cancellation_manager) File "E:\anaconda\envs\yolov3\lib\site-packages\tensorflow_core\python\eager\function.py", line 511, in call ctx=ctx) File "E:\anaconda\envs\yolov3\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute six.raise_from(core._status_to_exception(e.code, message), None) File "", line 3, in raise_from tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: {{function_node inference_Datasetmap_14255}} Paddings must be non-negative: 0 -128 [[{{node Pad}}]] [[IteratorGetNext]] [[Shape/_10]] (1) Invalid argument: {{function_node inference_Datasetmap_14255}} Paddings must be non-negative: 0 -128 [[{{node Pad}}]] [[IteratorGetNext]] 0 successful operations. 0 derived errors ignored. [Op:__inference_distributed_function_50530]

Function call stack: distributed_function -> distributed_function

WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-8 W0317 22:49:04.145999 22200 util.py:144] Unresolved object in checkpoint: (root).layer-8 WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-9 W0317 22:49:04.148992 22200 util.py:144] Unresolved object in checkpoint: (root).layer-9 WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-10 W0317 22:49:04.148992 22200 util.py:144] Unresolved object in checkpoint: (root).layer-10 WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-11 W0317 22:49:04.149989 22200 util.py:144] Unresolved object in checkpoint: (root).layer-11 WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-8.arguments W0317 22:49:04.149989 22200 util.py:144] Unresolved object in checkpoint: (root).layer-8.arguments WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-8._variable_dict W0317 22:49:04.149989 22200 util.py:144] Unresolved object in checkpoint: (root).layer-8._variable_dict WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-8._trainable_weights W0317 22:49:04.149989 22200 util.py:144] Unresolved object in checkpoint: (root).layer-8._trainable_weights WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-8._non_trainable_weights W0317 22:49:04.150986 22200 util.py:144] Unresolved object in checkpoint: (root).layer-8._non_trainable_weights WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-9.arguments W0317 22:49:04.150986 22200 util.py:144] Unresolved object in checkpoint: (root).layer-9.arguments WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-9._variable_dict W0317 22:49:04.150986 22200 util.py:144] Unresolved object in checkpoint: (root).layer-9._variable_dict WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-9._trainable_weights W0317 22:49:04.150986 22200 util.py:144] Unresolved object in checkpoint: (root).layer-9._trainable_weights WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-9._non_trainable_weights W0317 22:49:04.151983 22200 util.py:144] Unresolved object in checkpoint: (root).layer-9._non_trainable_weights WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-10.arguments W0317 22:49:04.151983 22200 util.py:144] Unresolved object in checkpoint: (root).layer-10.arguments WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-10._variable_dict W0317 22:49:04.151983 22200 util.py:144] Unresolved object in checkpoint: (root).layer-10._variable_dict WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-10._trainable_weights W0317 22:49:04.152985 22200 util.py:144] Unresolved object in checkpoint: (root).layer-10._trainable_weights WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-10._non_trainable_weights W0317 22:49:04.152985 22200 util.py:144] Unresolved object in checkpoint: (root).layer-10._non_trainable_weights WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-11.arguments W0317 22:49:04.152985 22200 util.py:144] Unresolved object in checkpoint: (root).layer-11.arguments WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-11._variable_dict W0317 22:49:04.153978 22200 util.py:144] Unresolved object in checkpoint: (root).layer-11._variable_dict WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-11._trainable_weights W0317 22:49:04.153978 22200 util.py:144] Unresolved object in checkpoint: (root).layer-11._trainable_weights WARNING:tensorflow:Unresolved object in checkpoint: (root).layer-11._non_trainable_weights W0317 22:49:04.153978 22200 util.py:144] Unresolved object in checkpoint: (root).layer-11._non_trainable_weights WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/alpha/guide/checkpoints#loading_mechanics for details. W0317 22:49:04.154975 22200 util.py:152] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/alpha/guide/checkpoints#loading_mechanics for details.

chen-chien-lung commented 4 years ago

I don't know whether it is correct or not , but you can check whether you divide the height or width of the original picture to xmin, xmax, ymin, ymax when you made the tfrecord file .

Eve66666 commented 4 years ago

I don't know whether it is correct or not , but you can check whether you divide the height or width of the original picture to xmin, xmax, ymin, ymax when you made the tfrecord file .

您好,我按照源码中的voc2012.py文件生成tfrecord文件,py文件中已经将 xmin, xmax, ymin, ymax 除以对应的width,height。同时我添加了一些代码,确认了他们均处于0-1之间,并且height、width均存在,但是依旧出现以上错误。数据集的检测是否还涉及其他的方面?在生成tfrecord文件时,我看它录入了很多的数据(如:difficult,format,filename等等),请问这些数据可以不输入吗?

chen-chien-lung commented 4 years ago

OK, I had met the same error logs as you. I think you can also check whether you modify the yolo_anchors and yolo_anchor_masks in models.py for your own dataset.
You don't need to prepare all the data(difficult,format,filename...) when you create the tfrecord file. You can see the "IMAGE_FEATURE_MAP" in datasets.py, there are many items had been commented.

gaofssvm commented 4 years ago

I had met the same errors,have you solved it?

gaofssvm commented 4 years ago

@Eve66666 ,I had met the same errors,have you solved it?

funkydude755 commented 4 years ago

i solved this by removing the smallest bounding boxes from my data-set

cyy90 commented 4 years ago

i solved this by removing the smallest bounding boxes from my data-set

thanks very much. yes, I also bypass this problem by not append the small box into the example in voc2012.py

gaofssvm commented 4 years ago

@funkydude755 ,thanks, do you know why?

funkydude755 commented 4 years ago

I belive it's an issue with box resizing. Since the images are resized by default in this yolov3 version

krxat commented 4 years ago

i solved this by removing the smallest bounding boxes from my data-set

What do you mean the smallest bounding box in the dataset? Can you explain it? I am also facing the same problem.

cyy90 commented 4 years ago

actually, not only the smallest box, i also removed bounding box with size < 9/10000 of the whole figure. you can try this threshold.

funkydude755 commented 4 years ago

@cypherix the boxes with smallest area

krxat commented 4 years ago

@funkydude755 the box with the smallest area in the whole dataset? Do you know how that might affect this issue? Thanks

funkydude755 commented 4 years ago

@cypherix remove the smallest boxes (not single box) until it works. I belive its an issue with box and image resizing (which happens by default). I sorted my boxes by area and started removing untill it worked