Experiments on larger datasets

Zhongdao commented 7 years ago

Hi guys, I tested A-Softmax loss on CASIA dataset and it really does well, reaching a ~99% performance on lfw without careful tuning. But when I switch to use the MS-Celeb-1M datasets for training(roughly cleaned, 98.8% lfw with Lighten CNN), either the 28-layer net nor the 64-layer net seems not to converge. The 64-layer net is the same as mentioned in your paper. Have you tried A-Softmax in such large dataset? I am still trying to tune my net but have no idea why it doesn't converge.

wy1iu commented 7 years ago

Yes, we have trained A-Softmax loss on much larger dataset (MS dataset) and it definitely can work.

You should consider to modify the function of lambda or use the fine-tuning trick on pre-trained SphereFace network.

wuqiangch commented 7 years ago

@wy1iu How to train A-Softmax loss on MS-Celeb-1M datasets ? Can you show the details about training the net? I have try it and it doesn't converge,too.Thanks!

KaleidoZhouYN commented 7 years ago

@wuqiangch I have tried A-Softmax_loss on Ms-Celeb-1M datasets.

First you should train a model with type "SINGLE" instead of "QUADRUPLE",then fine-tune this model by changing the type back to "QUADRUPLE" .Take care that fine-tune will automatically set the parameter "iteration" to 0,so you also need to change "iteration" or "lambda" by yourself.

I think that the reason why your training didn't converge is that QUADRUPLE constraint is too strong for a almost 10K classification problem.

Good Luck!

Zhongdao commented 7 years ago

@wy1iu @KaleidoZhouYN Thanks a lot! The pre-trained model indeed help converge and proper lambda value is important. I am still tuning those hyper-parameter to get more satisfying results. By the way, have you test your models trained with MS-1M on MegaFace?

hardegg commented 7 years ago

@Zhongdao Can you help explain how you use pre-trained model? I followed @KaleidoZhouYN and used "SINGLE" to get the pre-trained model. Then I use it for fine tuning but ended up failure. I changed "iteration" to the number of iterations during pre-training, and/or "power" to a bigger value (e.g., 100000), but neither can lead to convergence.

Zhongdao commented 7 years ago

@hardegg I just use the "SINGLE" method to get pre-trained model. Please note that sometimes A-softmax loss seems not to converge but actually the model is getting better. I think it might be a property of A-softmax. In my experiment, I set lambda_min = 10 and gamma = 0.5, reaching 99.42% on lfw

hardegg commented 7 years ago

@Zhongdao Thanks for the reply. So you mean you did not use "QUADRUPLE"?

Zhongdao commented 7 years ago

@hardegg No, First I train a model with type "SINGLE" ,then fine-tune this model by changing the type back to "QUADRUPLE".

wuqiangch commented 7 years ago

@Zhongdao， the loss of your finnal model？When you train the "SINGLE" model ,what's the lambda_min and gamma ?. When you finetune using QUADRUPLE ,What‘s the lambda_min and gamma?

hardegg commented 7 years ago

@Zhongdao So I guess your finetuning did not converge eventually? Could you paste your log? I do the exactly same thing as you did (with pre-trained using SINGLE, and changed lambda_min=10, gamma=0.5), but the softmax_loss is always 87.3365 even after lots of iterations.

hardegg commented 7 years ago

@wy1iu I also tried training from scratch using ms_celeb dataset. After some failure, I tried to make it closer to original softmax in the beginning, by making "base" bigger and gamma smaller, (e.g., base=10000 and gamma = 0.01). It did converge in the beginning, but after a number of iterations (20k, lambda become around 40) the overall loss ascended and it still diverged (with softmax_loss always being 87.3365). I believe sphereface can definitely work on larger dataset. Could you give more details (log file is better)?

KaleidoZhouYN commented 7 years ago

@Zhongdao fantastic!!!! The model I trained on MS_celeb_1M just get around 99.17% accuracy on lfw so I thought there might be some problems with the margin_inner_product code and then turned to the BN method...(actually the code is not exactly the same as what the paper says..)

Seem like that I need to train A-softmax on Ms dataset again,em..Can you show more details in training,like learning rate,weight_decay and iteration of the pre-train model?Thanks a lot.

By the way,I'll set up MegaFace benchmark this week and hope to get good result.

Zhongdao commented 7 years ago

@hardegg Please refer to issue #7 for details on how to pre-train with SINGLE(cosine loss). Sometimes I also met rapid loss divergence, and I observed the same phenomenon training with center loss. It happens with a probability, so I tried many times to get a converged model. @KaleidoZhouYN Here is my solver:

net: "sphereface_model.prototxt" base_lr: 0.01 lr_policy: "multistep" gamma: 0.1 stepvalue: 160000 stepvalue: 240000 stepvalue: 280000 max_iter: 280000 display: 20 momentum: 0.9 weight_decay: 0.0005 snapshot: 2000 snapshot_prefix: "weights/ms_res64_lambda10" solver_mode: GPU

Batch_size is set to 256.

KaleidoZhouYN commented 7 years ago

@hardegg direct way to set the learning rate is to look at the backward diff of Margin_innerproduct layer,since the norm of the param is always around 1. The norm of the feature(output of the fc5 layer)is also very important because it only converges when feature norm is quite large.(The paper "L2 constrain Normalization" is helpful on this).

BeginnerW commented 7 years ago

@wy1iu I got the same LFW accuracy about 99.2 both on large dataset and small dataset. What might be the problem? what's accuracy did you get on large dataset (MS dataset) ? Thanks!

wy1iu commented 7 years ago

@BeginnerW Large datasets such as MS-1M have a lot of overlapping labels with LFW, so you should not directly train on MS and test on LFW. FYI, if you train on MS-1M directly without removing the overlapping identities, you can get incredibly high accuracy like 99.7.

hardegg commented 7 years ago

@wy1iu Yes, it's also what I am expecting to see. By using MS-1M on center-face, I got 99.63% accuracy on LFW with the original 27 layer networks. For A-Softmax, I am expecting to see better result. But right now, I am still stuck at the networks cannot converge. Any instruction? Could you explain how you trained on MS-1M? Thanks.

BeginnerW commented 7 years ago

@wy1iu I trained on the clean list with less than 4 millions samples which was released by LightenCNN's author. I don't why training on large dataset can not achieve higher accuracy than on small dataset. Using the caffemodel trained on small dataset as large dataset training's initial model, is that a problem? I hope you could give me some suggestions.

wy1iu commented 7 years ago

@hardegg Original softmax loss trainied on MS-1M could also easily give you very high accuracy. For A-softmax, you can consider three things: 1) use longer iteration 2) try smaller lambda (but not too small) 3) use fine-tuning trick (that means you should first train your model on MS-1M using original softmax loss and then use A-Softmax to finetune)

@BeginnerW If you use large dataset to finetune, you should first pretrain your model on the same dataset instead of the small one. Then finetune with A-softmax (iteration should be long enough).

hardegg commented 7 years ago

@wy1iu Yes correct. Directly using softmax + MS-1M can already reach good accuracy on LFW. However, when you test the model in real applications, the accuracy is not that good any more. But if you train models with center loss, better accuracy can be reached on LFW, plus you will get much better accuracy when dealing with real applications. Now I think sphereface should give better result on larger training dataset.

Thanks for your suggestion. For 2) smaller lambda, can I know how small it can be, say 50, 100? Do you mind sharing the training log? In fact I've trained both, from scratch and fine tuning. For fine tuning, I use SINGLE first and it converged very easily. But when doing fine tuning it fell into softmax_loss=87.35 after a while. For training from scratch, it cannot converge if lambda is small. But if lambda is very big, say 10000, it converges but it's much more like the original softmax.

KaleidoZhouYN commented 7 years ago

@wy1iu Hi，below is my test on LFW，obviously the cosine distance peak of the same person is around 0.7，while trained with center_loss can rich to 0.8 lfw I want to know if this correspond to your 99.7% accuracy model?

ctgushiwei commented 7 years ago

@KaleidoZhouYN how to draw this pic test on LFW?

KaleidoZhouYN commented 7 years ago

I'm back and very happy to say that Asoftmax with a 28-layer-ResFace-Network can rich acc=99.63% and TAR=99.1%@FAR = 0.1% on LFW and a similar result on MegaFace with what paper says while training on MS_CELEB_1M dataset.

KaleidoZhouYN commented 7 years ago

@ctgushiwei Save the similaritys of each pair and plot in Matlab

vzhangmeng726 commented 7 years ago

@KaleidoZhouYN great !!!! Can you show more details how to trian 28-layer-ResFace-Network, ,like learning rate,weight_decay and iteration of the pre-train model? can you share model.prototxt and solver.prototxt,. Thanks a lot.

ysc703 commented 7 years ago

@KaleidoZhouYN Can you show more details and your prototxts? Thanks a lot.

nyyznyyz1991 commented 7 years ago

@KaleidoZhouYN Amazing! can you share the 28-layer mode prototxt and solver.prototxt? I have been struggling to train 64 layer resnet with large datasets. And what is the batch_size you set for 28 layer resnet? 256 2gpu? or 1284gpu? Thanks a lot.

KaleidoZhouYN commented 7 years ago

@vzhangmeng726 @ysc703 @nyyznyyz1991
please look through this web site: https://github.com/KaleidoZhouYN/Details-on-Face-Recognition

Zhongdao commented 7 years ago

@KaleidoZhouYN Great job! I'll be glad if we can discuss more about face recognition on wechat, if you want. Here is my account: 13051902595.

HaoLiuHust commented 7 years ago

@KaleidoZhouYN have you clean the MS_CELEB_1M dataset? it seems the dataset has some overlap with LFW

XWalways commented 7 years ago

@KaleidoZhouYN I trained the model on msceleb dataset by setting m=1,base=1000,gamma=0.000025,power=35,lambda_min=0,iteration=0 , got loss=0.871851,accuracy=0.8525,then I finetuned by setting m=4,lr_mult(in fc6)=10,decay_mult(in fc6)=10,what's more,I changed "fc6" to "fc7"(that means I did't use the parameters in the caffemodel),but I failed in the end.Why?

KaleidoZhouYN commented 7 years ago

@XWalways A-softmax is good for generalization but it doesn't mean you can do everything you want such as set m=4 and lambda_min = 0.

KaleidoZhouYN commented 7 years ago

@HaoLiuHust No,we didn't.

HaoLiuHust commented 7 years ago

@KaleidoZhouYN then the accuracy may be higher than it really is

XWalways commented 7 years ago

@KaleidoZhouYN m=4 means type:"QUADRUPLE",m=1 means type:"SINGLE".You said that we should train with m=1 and finetune with m=4.I want to know how to change parameters when finetuning.Thanks

KaleidoZhouYN commented 7 years ago

@XWalways set lambda_min = 0 is a bad choice

XWalways commented 7 years ago

@KaleidoZhouYN But How should I modify the parameters while finetuning? I have tried many times，but all failed.Thanks a lot.

HaoLiuHust commented 6 years ago

@KaleidoZhouYN Thanks, in your training, the alignment method is https://github.com/happynear/FaceVerification/dataset/CK/align_CK.py or develop a new method?

johnnysclai commented 6 years ago

@KaleidoZhouYN Great works. Did you try to train on the CASIA-WebFace dataset with center loss? I am wondering what the accuracy will it be.

KaleidoZhouYN commented 6 years ago

@HaoliuHust well,On MS_celeb_1M,we use MTCNN,but on our own dataset,the landmark will be different,not use MTCNN,If you are concerning about the alignment,please see: https://github.com/sciencefans/RSA-for-object-detection by sensetime and the result is fantastic and much better than ours.- -|

HaoLiuHust commented 6 years ago

@KaleidoZhouYN Thank you for your warmly reply, could you help to point out where is the alignment part? is it in get_rect_from_pts.m?

HaoLiuHust commented 6 years ago

@KaleidoZhouYN found it, thanks

ctgushiwei commented 6 years ago

@Zhongdao @KaleidoZhouYN @wy1iu @wuqiangch do you have trained res20 use asoftmax?and what the accuracy test on LFW? I train a model only achieve 99.1% use m=2.lambda_min=3

JoyLuo commented 6 years ago

@KaleidoZhouYN Do you change the iteration value manually when finetune the net in QUAFRUPLE? For example, the iteration of pretrained model with SINGLE type is 28000, so the iteration value should be 28000 when finetune the net in QUADRUPLE.

MengWangTHU commented 6 years ago

@KaleidoZhouYN you say "Take care that fine-tune will automatically set the parameter "iteration" to 0". Do you mean when finetuning, the parameter "iteration" (who's default value is 0) should also be changed? I know what other parameters mean, lile gamma, lambda, but I do not know what dose this parameter mean.

wangce888 commented 6 years ago

@KaleidoZhouYN how to set the parameters when finetuning, I awayls get bad result

yxchng commented 6 years ago

@Zhongdao @KaleidoZhouYN Hi I see that your discussion seems to suggest using lambda_min=10 and m=4? Is it that true? And that lambda_min=5 will not work?

shineway14 commented 6 years ago

@KaleidoZhouYN how to change "iteration" or "lambda" when finetuning,Thanks

wy1iu / sphereface

Experiments on larger datasets #14