Question regarding paper

melgor commented 7 years ago

I really grateful that @wy1iu release the code. You and @ydwen are really pushing the Face-Verification forward.

I have some question regarding the paper and the code:

What is the major change between L-SoftMax and A-SoftMax. For the equation it look like that in L-SoftMax weight are transformed to norm of weights and in A-SofrMax weight are transformed to normalized weight, right? If this is true, the main motivation was section 3.3 in Large-Margin Softmax Loss for Convolutional Neural Networks?
Could you explain how did you choose function ψ (which replace cos(θ))?
In both paper you use Taylor Series of cos(mθ) (Eq. 7 in Large Margin), right? What was the idea behind using different degree of series based on margin value? Why not using same for all margin?
Here is my intuition behind this both paper: in fact we just scale the output from Linear layer by matrix of ones with different numbers (<1) on target classes. Both paper propose different method of scaling (with theoretical explanation). I think that maybe there is possible to make implementation which would just use scale matrix. I must think about it as there are many non-linear operation here.
I was thinking about using CenterLoss but using cosine similarity. But then I realized than it is equivalent to SoftMax layer without bias (and SoftMax also compare features to other class center as well, not only target, so it make features even better). Do you agree with my interpretation?

wy1iu commented 7 years ago

Hi, thank you for your interest in our work. I am happy to answer your questions. :)

A-Softmax normalizes the weights and zeros out the biases in the final FC layer, which makes the loss only penalizes the angles. In contrast, L-Softmax will not necessarily normalize the weights and zero out the biases, so it does not necessarily penalize the angles, although in toy examples it did. The main difference is clearly described in the SphereFace paper.
It is natural to reserve the $[0,\frac{\pi}{m}]$ part of the function $cos(m\theta)$. So all we need to do is to design the $[\frac{\pi}{m},\pi]$ part. In fact, the design of this part is not very crucial as long as it is monotonically decreasing.
It is simply a decomposition for $cos(m\theta)$. When $m$ changes, the decomposition will change too.
Of course there will be a different interpretation (somewhat matrix form). However, as you mentioned, this nonlinearity may be difficult to model.
I kinda agree. Softmax without biases may do the same job as center loss, although the back-prop dynamics may be different. Thus adding center loss may help much at the begining, but will improve less and less while the iteration goes. However, combining center loss and softmax loss still make sense to me.

melgor commented 7 years ago

I have question about your nice implementation of MarginInnerProductLayer It is very efficient, much more than using formulas from paper.

I almost understand the idea behind it. but I still can not understand how did you find formulas for sign_1 and others. It is very interesting way for replacing any for/while loop for finding value for k. Could you explain how did you found such formulas or maybe point in what kind of field should I analyse/understand to get intuition behind it?

melgor commented 7 years ago

Could you explain how you get approximation for this equation?

ydwen commented 7 years ago

Hi melgor. I am not sure I have understood exactly what you are asking. I guess you are confused by the implementation. Why we didn't completely follow the equations in the paper to implement the layer? The answer is efficiency. It is an alternative implementation and there is no approximation in our code. Sign_1 and others are intermediate variables, which are designed to avoid replicated computation. It may not be the optimal way but a trade-off between speed and memory.

wy1iu commented 7 years ago

Sorry for missing your question @melgor. As ydwen mentioned, our implementation is efficient in the sense that you have stored some of your intermediate computation results for subsequent reuse (similar to the idea of dynamic programming). It is basically to trade memory for speed. Most importantly, this implementation is totally equivalent to the original formulation in the paper (no approximation happens).

melgor commented 7 years ago

Thanks for the answer. I was just trying to get your equation from original in paper and I could not get exactly the same answers. (I'm doing it as a exercise as your implementation is much faster than simple one)

nyyznyyz1991 commented 6 years ago

@wy1iu @melgor Thanks for your discussion, the implementation of sign_3 and sign_4(with m = 4) is impressive and elegant, it gets rid of the calculation of theta using arg_cos and avoids replicated computation. How did you deduct the formula? sign_3 = sign_0 sign(2 cos_thetaquadratic - 1) sign_4 = 2 * sign_0 + sign_3 - 3 Is there any explanation about it?

amirhfarzaneh commented 6 years ago

Can someone please explain why the psi function has to be monotonically decreasing? @wy1iu , @melgor

wy1iu / sphereface

Question regarding paper #1