Open soldierofhell opened 5 years ago
Hi great to see you went deep into the paper.
I guess you can interpret this as x,y are a result of logistic regression on center coordinate offset. You can see in section 4 of the paper "4. Things We Tried That Didn’t Work" he mentioned that "Linear x, y predictions instead of logistic." decreases accuracy. That is the reason we use SSE loss on logits instead.
@zzh8829 Speaking of which, two questions
Is there a reason why you encoded the target using a single class instead of one-hot, which is the output of the network? To save memory? It only matters during calculating the loss function anyway, and it seems either way would work with sparse_categorical_crossentropy()
Also, I would encode the box in the target in this format t_x, t_y, t_w, t_h
, as the paper suggested, same as the network output, so there would be less code in the loss function. Is there a reason why you encoded the target in the original bbox format xmin, ymin, xmax, ymax
and converted it in the loss function? which also differs from how Tensorflow API parses the box in order of ymin, xmin, ymax, ymin
.
Thanks! :)
Mathematically speaking they are the same formula but as this post pointed out https://datascience.stackexchange.com/a/41923 , when the categories are mutually exclusive, sparse crossentrophy is much more performant.
The encoding order of targets doesn't really matter. The end result from training is the same and the dataset is saved in tfrecord which is indexed by name not array order. The reason I didn't do t_x,t_y
in the dataset transform is because t_*
depends on the network output size and anchor size. It's more convenient to calculate them in loss function.
Hi @zzh8829, Thanks for your work. I must admit I never dig into original darknet implementation, but after getting back to the original paper I noticed two inconsistencies (?):
box loss: in your code According to paper see
scales of anchors: in your code all anchors are on the same grid 13x13, but in the model there are 3 scales. Shouldn't we use more granular scales for smaller anchors? (similar as in RetinaNet)
Regards,