differences vs. original paper

zzh8829 / yolov3-tf2

YoloV3 Implemented in Tensorflow 2.0

MIT License

2.51k stars 909 forks source link

differences vs. original paper #102

Open soldierofhell opened 5 years ago

soldierofhell commented 5 years ago

Hi @zzh8829, Thanks for your work. I must admit I never dig into original darknet implementation, but after getting back to the original paper I noticed two inconsistencies (?):

box loss: in your code According to paper see
scales of anchors: in your code all anchors are on the same grid 13x13, but in the model there are 3 scales. Shouldn't we use more granular scales for smaller anchors? (similar as in RetinaNet)

Regards,

zzh8829 commented 4 years ago

Hi great to see you went deep into the paper.

The implementation here true to paper author's original code You can see here https://github.com/pjreddie/darknet/blob/61c9d02ec461e30d55762ec7669d6a1d3c356fb2/src/yolo_layer.c#L141 He used sigmoid function on x,y coordinate and then loss function also uses the same x,y after sigmoid https://github.com/pjreddie/darknet/blob/61c9d02ec461e30d55762ec7669d6a1d3c356fb2/src/yolo_layer.c#L98

I guess you can interpret this as x,y are a result of logistic regression on center coordinate offset. You can see in section 4 of the paper "4. Things We Tried That Didn’t Work" he mentioned that "Linear x, y predictions instead of logistic." decreases accuracy. That is the reason we use SSE loss on logits instead.

There are indeed 3 different scales in my code. The grid sizes are determined automatically based on conv layer's output size. You can see the here https://github.com/zzh8829/yolov3-tf2/blob/78ff113b5581937c61772f42378bb9caab14f13c/yolov3_tf2/dataset.py#L68 Each of the 3 scale in fact has different grid size.

nicolefinnie commented 4 years ago

@zzh8829 Speaking of which, two questions

Is there a reason why you encoded the target using a single class instead of one-hot, which is the output of the network? To save memory? It only matters during calculating the loss function anyway, and it seems either way would work with sparse_categorical_crossentropy()
Also, I would encode the box in the target in this format t_x, t_y, t_w, t_h, as the paper suggested, same as the network output, so there would be less code in the loss function. Is there a reason why you encoded the target in the original bbox format xmin, ymin, xmax, ymax and converted it in the loss function? which also differs from how Tensorflow API parses the box in order of ymin, xmin, ymax, ymin.

Thanks! :)

zzh8829 commented 4 years ago

Mathematically speaking they are the same formula but as this post pointed out https://datascience.stackexchange.com/a/41923 , when the categories are mutually exclusive, sparse crossentrophy is much more performant.
The encoding order of targets doesn't really matter. The end result from training is the same and the dataset is saved in tfrecord which is indexed by name not array order. The reason I didn't do t_x,t_y in the dataset transform is because t_* depends on the network output size and anchor size. It's more convenient to calculate them in loss function.