zzh8829 / yolov3-tf2

YoloV3 Implemented in Tensorflow 2.0
MIT License
2.51k stars 905 forks source link

Small training examples cause Padding error #328

Closed FicekD closed 3 years ago

FicekD commented 3 years ago

I've come across many comments that small training examples cause a padding error (Paddings must be non-negative: 0 -1) during training (fit call). I've also come across this issue and removing a class which contained a lot of small objects solved my issue. Has anyone tested what is the minimal example size is or maybe how to fix this issue?

Full error for reference:

Traceback (most recent call last):
  File "d:\ALL-IN\Programming\python\gestures\train.py", line 214, in <module>
    main()
  File "d:\ALL-IN\Programming\python\gestures\train.py", line 206, in main
    history = model.fit(train_data, epochs=EPOCHS, callbacks=callbacks, validation_data=val_data)
  File "C:\Users\dfice\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "C:\Users\dfice\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "C:\Users\dfice\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "C:\Users\dfice\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "C:\Users\dfice\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\dfice\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 632, in _call
    return self._stateless_fn(*args, **kwds)
  File "C:\Users\dfice\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\eager\function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "C:\Users\dfice\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\eager\function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "C:\Users\dfice\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\eager\function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "C:\Users\dfice\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\eager\function.py", line 545, in call
    ctx=ctx)
  File "C:\Users\dfice\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  Paddings must be non-negative: 0 -1
     [[{{node Pad}}]]
     [[IteratorGetNext]]
     [[loss/yolo_output_2_loss/Shape_1/_16]]
  (1) Invalid argument:  Paddings must be non-negative: 0 -1
     [[{{node Pad}}]]
     [[IteratorGetNext]]
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_function_48556]

Function call stack:
distributed_function -> distributed_function
FicekD commented 3 years ago

So since I encountered multiple people having this problem, I'll post my solution.

# dataset.py
def parse_tfrecord(tfrecord, class_table, size):
    # ...
    paddings = [[0, FLAGS.yolo_max_boxes - tf.shape(y_train)[0]], [0, 0]]
    y_train = tf.pad(y_train, paddings)

    return x_train, y_train

As of now, I believe the only source of Padding error is from ground truth label padding with zeros, personally, I changed the FLAGS.yolo_max_boxes flag to constant when I used the network in past with a low amount of maximum objects detected, forgot about it and copied the code...

I've seen some posts that removing small training examples fixed this issue, but it solved the problem probably just because it reduced the number of objects on a single training example.

jiangxinufo commented 3 years ago

2021-04-24 14:19:34.278458: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Invalid argument: Pad dings must be non-negative: 0 -12 [[{{node Pad}}]] [[IteratorGetNext]] 2021-04-24 14:19:34.290370: I tensorflow/core/profiler/lib/profiler_session.cc:225] Profiler session started. 1/Unknown - 13s 13s/stepWARNING:tensorflow:Reduce LR on plateau conditioned on metric val_loss which is not available. Available metrics are: lr W0424 14:19:34.282967 1632 callbacks.py:1934] Reduce LR on plateau conditioned on metric val_loss which is not available. Available metrics are: lr

WARNING:tensorflow:Early stopping conditioned on metric val_loss which is not available. Available metrics are: W0424 14:19:34.282967 1632 callbacks.py:1286] Early stopping conditioned on metric val_loss which is not available. Available metrics are:

Epoch 00001: saving model to checkpoints/yolov3_train_1.tf 1/Unknown - 17s 17s/stepTraceback (most recent call last): File "train.py", line 196, in app.run(main)