zzh8829 / yolov3-tf2

YoloV3 Implemented in Tensorflow 2.0
MIT License
2.51k stars 909 forks source link

differences vs. original paper #102

Open soldierofhell opened 5 years ago

soldierofhell commented 5 years ago

Hi @zzh8829, Thanks for your work. I must admit I never dig into original darknet implementation, but after getting back to the original paper I noticed two inconsistencies (?):

  1. box loss: in your code image According to paper image see image

  2. scales of anchors: in your code all anchors are on the same grid 13x13, but in the model there are 3 scales. Shouldn't we use more granular scales for smaller anchors? (similar as in RetinaNet)

image

Regards,

zzh8829 commented 4 years ago

Hi great to see you went deep into the paper.

  1. The implementation here true to paper author's original code You can see here https://github.com/pjreddie/darknet/blob/61c9d02ec461e30d55762ec7669d6a1d3c356fb2/src/yolo_layer.c#L141 He used sigmoid function on x,y coordinate and then loss function also uses the same x,y after sigmoid https://github.com/pjreddie/darknet/blob/61c9d02ec461e30d55762ec7669d6a1d3c356fb2/src/yolo_layer.c#L98

I guess you can interpret this as x,y are a result of logistic regression on center coordinate offset. You can see in section 4 of the paper "4. Things We Tried That Didn’t Work" he mentioned that "Linear x, y predictions instead of logistic." decreases accuracy. That is the reason we use SSE loss on logits instead.

  1. There are indeed 3 different scales in my code. The grid sizes are determined automatically based on conv layer's output size. You can see the here https://github.com/zzh8829/yolov3-tf2/blob/78ff113b5581937c61772f42378bb9caab14f13c/yolov3_tf2/dataset.py#L68 Each of the 3 scale in fact has different grid size.
nicolefinnie commented 4 years ago

@zzh8829 Speaking of which, two questions

Thanks! :)

zzh8829 commented 4 years ago
  1. Mathematically speaking they are the same formula but as this post pointed out https://datascience.stackexchange.com/a/41923 , when the categories are mutually exclusive, sparse crossentrophy is much more performant.

  2. The encoding order of targets doesn't really matter. The end result from training is the same and the dataset is saved in tfrecord which is indexed by name not array order. The reason I didn't do t_x,t_y in the dataset transform is because t_* depends on the network output size and anchor size. It's more convenient to calculate them in loss function.