Closed melgor closed 7 years ago
Hi, thank you for your interest in our work. I am happy to answer your questions. :)
A-Softmax normalizes the weights and zeros out the biases in the final FC layer, which makes the loss only penalizes the angles. In contrast, L-Softmax will not necessarily normalize the weights and zero out the biases, so it does not necessarily penalize the angles, although in toy examples it did. The main difference is clearly described in the SphereFace paper.
It is natural to reserve the $[0,\frac{\pi}{m}]$ part of the function $cos(m\theta)$. So all we need to do is to design the $[\frac{\pi}{m},\pi]$ part. In fact, the design of this part is not very crucial as long as it is monotonically decreasing.
It is simply a decomposition for $cos(m\theta)$. When $m$ changes, the decomposition will change too.
Of course there will be a different interpretation (somewhat matrix form). However, as you mentioned, this nonlinearity may be difficult to model.
I kinda agree. Softmax without biases may do the same job as center loss, although the back-prop dynamics may be different. Thus adding center loss may help much at the begining, but will improve less and less while the iteration goes. However, combining center loss and softmax loss still make sense to me.
I have question about your nice implementation of MarginInnerProductLayer It is very efficient, much more than using formulas from paper.
I almost understand the idea behind it. but I still can not understand how did you find formulas for sign_1 and others. It is very interesting way for replacing any for/while loop for finding value for k. Could you explain how did you found such formulas or maybe point in what kind of field should I analyse/understand to get intuition behind it?
Could you explain how you get approximation for this equation?
Hi melgor. I am not sure I have understood exactly what you are asking. I guess you are confused by the implementation. Why we didn't completely follow the equations in the paper to implement the layer? The answer is efficiency. It is an alternative implementation and there is no approximation in our code. Sign_1 and others are intermediate variables, which are designed to avoid replicated computation. It may not be the optimal way but a trade-off between speed and memory.
Sorry for missing your question @melgor. As ydwen mentioned, our implementation is efficient in the sense that you have stored some of your intermediate computation results for subsequent reuse (similar to the idea of dynamic programming). It is basically to trade memory for speed. Most importantly, this implementation is totally equivalent to the original formulation in the paper (no approximation happens).
Thanks for the answer. I was just trying to get your equation from original in paper and I could not get exactly the same answers. (I'm doing it as a exercise as your implementation is much faster than simple one)
@wy1iu @melgor Thanks for your discussion, the implementation of sign_3 and sign_4(with m = 4) is impressive and elegant, it gets rid of the calculation of theta using arg_cos and avoids replicated computation. How did you deduct the formula? sign_3 = sign_0 sign(2 cos_thetaquadratic - 1) sign_4 = 2 * sign_0 + sign_3 - 3 Is there any explanation about it?
Can someone please explain why the psi function has to be monotonically decreasing? @wy1iu , @melgor
I really grateful that @wy1iu release the code. You and @ydwen are really pushing the Face-Verification forward.
I have some question regarding the paper and the code: