wy1iu / sphereface

Implementation for <SphereFace: Deep Hypersphere Embedding for Face Recognition> in CVPR'17.
MIT License
1.59k stars 542 forks source link

Is lambda equivalent to smaller m? #11

Closed happynear closed 7 years ago

happynear commented 7 years ago

From the form of the loss function, I think adding the lambda can be seen as using smaller m. I have plotted the curve of lambda=5, m=4 and find that it is approximately m=1.5. image

Is my thought right?

chichan01 commented 7 years ago

Hi, Would you like to tell me how can you plot the above curve?

happynear commented 7 years ago

Sure. I draw the figure using these Matlab codes.

theta = 0:1:180;
figure(1);
plot(theta, cosd(theta),'LineWidth',2);
hold on;
Fai = zeros(4,length(theta));
for m=[2 4]
    for k=0:m-1
        Fai(m,theta >= k * 180 / m & theta <= (k+1)*180/m) = (-1)^k * cosd(m*theta(theta >= k * 180 / m & theta <=(k+1)*180/m)) - 2 * k;
    end;
    plot(theta, Fai(m,:),'LineWidth',2);
end;
m=4;
lambda=5;
plot(theta, (cosd(theta) * lambda + Fai(4, :)) / (1 + lambda),'LineWidth',2);
hold off;
legend('softmax(m=1)','large margin softmax(m=2, \lambda=0)','large margin softmax(m=4, \lambda=0)', ['large margin softmax(m=4, \lambda=' num2str(lambda) ')']);
wy1iu commented 7 years ago

Yes you are right in some senses, although the curves are not exactly the same. Experimentally, they also behaves differently. We have thought about this before and it seems to remain a big challenge to optimize the network with much smaller lambda_min or even zero lambda_min.

However, it actually shows the huge potential of our A-Softmax loss. Just imagine what will happen if the lambda_min could be exactly zero. For example, if we use much deeper network to increase the fitting ability, we might be able to set lambda_min to a smaller value. I believe it could largely increase the performance. In such sense, I would say the A-Softmax can make best use of the learning ability of much deeper network, since softmax loss will saturate in face recognition with much deeper network.

happynear commented 7 years ago

I also think softmax with large margin strategy has huge potential for very deep neural networks. I think the mathematic formulations of both large margin Softmax and A-Softmax are too complex and the hyperparameters are difficult to tune. I draw the curves to show that maybe we don't need such a complex piecewise function and don't need two hyperparameters (m and lambda) to tune the margin?

bkj commented 7 years ago

I definitely agree with @happynear . I've had reasonable success using

m * cos(theta) - m + 1

instead of the piecewise function proposed in Sphereface (which seemed odd to me -- not sure if I'm missing some reason it's good for it to be like that)

You lose the angular margin interpretation but the loss is simpler to implement and doesn't have any saddle points, unlike the L-Softmax/A-Softmax versions (I would assume saddle points are bad).

wy1iu commented 7 years ago

I definitely encourage you to find a way not to use lambda decay strategy. Any strategies that can make the network converge more easily are welcome here. :)

happynear commented 7 years ago

@bkj , I have also tried m * cos(theta) - m + 1. The implementation is much easier. But I still haven't get comparable results with A-softmax. I will try more.

bkj commented 7 years ago

I've never been able to get results as good as those reported in the Sphereface paper. It may have to do with the detection/preprocessing pipeline -- I've trained the same model on CASIA faces extracted w/ MTCNN (from https://github.com/davidsandberg/facenet) and w/ dlib and gotten 99.1 and 89.3 accuracy, respectively. That's a bigger difference than I would've expected.

Regardless, in my experiments, I've tried training w/ regular softmax for ~10 epochs then linearly increasing m by some small amount (eg 0.05) for ~20 epochs. That schedule could probably be tuned but it was sortof intuitively reasonable, and the final value of m yields a loss that's roughly the same magnitude as Sphereface w/ the QUADRUPLE setting and min_lambda=5.

Differing performance by detection/alignment algorithm is interesting -- if someone is able to provide me with the actual dataset of face chips that they've used to successfully train Sphereface to > 99% accuracy using the default settings, I'd love to do an experiment to compare how this effects model performance. Could write a blog post about it or something... I don't have MATLAB so can't run the original MTCNN by myself.

happynear commented 7 years ago

Here is a Python code that has the same functions with Matlab. https://github.com/walkoncross/prepare-faces-zyf/blob/master/align-faces-by-mtcnn/fx_warp_and_crop_face.py

wy1iu commented 7 years ago

You could reopen the issue or start a new one, if you have any further questions regarding this topic.

Erdos001 commented 6 years ago

I think it can be treated as linear interpolation function of two cosine function