wy1iu / sphereface

Implementation for <SphereFace: Deep Hypersphere Embedding for Face Recognition> in CVPR'17.
MIT License
1.59k stars 541 forks source link

Need help to verify a very bad result produced by your released model, to be published in a paper #98

Open yxchng opened 6 years ago

yxchng commented 6 years ago

This issue is not meant to degrade your work. I may have done something wrong in my test and I need some help in verifying if my test procedure is right. I have verified my code many times and I couldn't find any error.

The dataset is: http://www.robots.ox.ac.uk/~vgg/data/vgg_face2/. I used your face_detect_demo.m and face_align_demo.m to detect and align faces in vggface2_train (8631 people) and vggface2_test (500 people).

Following megaface procedure with slight modification, I make vggface2_train the test set and vggface2_test the distractor.

For each identity in test set, I compute the max intraclass distance of one photo (reference) with the rest of the photos. Then, I compute the min interclass distance of this reference with the rest of the photos in both the test set and the distractor, excluding photos of the identity being tested.

A good face recognition algorithm should have min interclass distance > max intraclass distance. Each image that has this property is considered correct.

Let the number of correct photo be C. Then, if there are N total photos in the dataset, the accuracy will be C/N.

Although nobody has done such test for face recognition, I believe this is the most rigorous, stringent and robust testing method.

The code for my test is here: https://gist.github.com/yxchng/dec1ec6fe306082684af70d85f2590e2. I save each features into a text file and read them in the code. You may do it your way if you want to test it. Hopefully, you can check that there is no logic error in my code.

The result I get for sphereface is a mind blowing 0%, which means there are always cases where the model misrecognize for each identity. Do you think this result is expected given my test setting? What am I doing wrong here. caffeface also give same result. If there is nothing wrong with my test, it is amazing that face recognition fails even with just 9131 people.

wy1iu commented 6 years ago

SphereFace can work well with VGGface2. Previously, we have successfully trained SphereFace on VGG2. Besides that, the results of InsightFace also has validated this. Your testing procedure may somehow be buggy. BTW, the preprocessing steps are crucial, and you should make sure the preprocessing is consistent between your training and testing set.

yxchng commented 6 years ago

@wy1iu Training on VGG2 and testing on it is different from training on CASIA-WebFace and testing on VGG2. And InsightFace has not validated this.

And as I said, I use your preprocessing code and model.

Do you need me to write in Chinese? I can write the issue in Chinese because you do not seem to have read and understand my post.

yxchng commented 6 years ago

这个不是一个传统的测试方式。没有人这么测试过,也许因为结果会是百分之零。

测试在这里: https://gist.github.com/yxchng/dec1ec6fe306082684af70d85f2590e2, 你可以验证。我觉得没问题。

为了确保结果不受python影响,我是先把feature存在txt文件: https://gist.github.com/yxchng/e46674b00f60e6e8d7696344e2dd2a1c, 在你代码上做的小修改来存features。

这个测试方式要每张图类内最远的距离近于类间最近的距离才算对。vggface2平均每张图类内300张图左右,类间3百万张。

如果总共有N张图,对了C张,准确率就是C/N。

这个测试结果说明没有一张图符合这个条件:

max_intraclass_margin (类内最远的距离) < min_interclass_margin (类间最近的距离)

我所用的preprocessing都是用你的代码,就算有差也是微差: https://github.com/wy1iu/sphereface/issues/93, 不会影响结论.

为什么要这么测试? 因为要高机密的应用,不允许任何误识别。这个测试是最严谨的。只用大家都用的且都是百分之99以上的测试方式也许会给人误解人脸识别的真实准确率。

insightface我还没测过,接着会测试, 但是他们绝对也没有这么测试过。

希望你可以帮忙验证一下,谢谢。

wy1iu commented 6 years ago

@yxchng 如果我没有理解错,你是用casia训练的模型,然后直接用vgg2的数据去测试,而protocol就是用的你描述的这个。我没检查你的测试代码(假设你的预处理、测试等部分没问题),但是我觉得你还是可以做个sanity check,你可以拿除了sphereface之类的模型,比如center loss,比如普通的softmax loss,做一个同样的测试,看看结果怎么样。我觉得在casia上训练的模型(比如我们提供的sphereface pretrained model),在vgg2上不能达到你说的那种“类内最大距离小于类间最小距离”,这个也说不定是有可能的,毕竟sphereface只是要求在训练集上尽可能产生大的angular margin,这个在如果完全不同的测试集上,能不能完全泛化过去,确实是不好说的。

happynear commented 6 years ago

@yxchng 你这个测试方法类似于FAR=0下去测TPR了,没有人会用这么严格的评测标准的,vggface2好像是3M张图吧,那大概用FAR=1e-6或者1e-7这样的指标比较好。