Reproducing the performance on LFW?

zeakey commented 6 years ago

I retrained the SpherefaceNet-20 from scratch but the performance on LFW cannt reach 99.30%. Evaluation log is as below:

fold    ACC
----------------
1   98.83%
2   98.83%
3   98.50%
4   99.17%
5   99.00%
6   99.50%
7   99.17%
8   99.50%
9   99.83%
10  99.33%
----------------
AVE 99.17%

My training enrironments are:

network: SpherefaceNet-20;
Training data: CASIA-WebFace, landmarks are obtained via MTCNN (actually I downloaded cropped images from the web, this may be an uncertain factor.)
GPUs: I use 2 GPUs and each one processes 256 samples at a single iteration.
LFW: I download the LFW dataset from its official website and use your provided preprocessing code to process them.

All other seetings are kept the same with your default configurations prototx files, the only thing that may be not clear is the number of GPUs

There are two suspicious factors that may cause the failure:

the actual batch-size. Since I use 2 GPUs x 256 batch-size
the training data. I downloaded the cropped images from the web.

Here is my training log http://data.kaiz.xyz/log/retrain_sphereface_June2-2018.log.

Additionally, with the released pretrain model sphereface_model.caffemodel I only obtain the average accuracy of 99.27. This may be a minor problem and it has been mentioned in https://github.com/wy1iu/sphereface/issues/93.

wy1iu commented 6 years ago

It is exactly the uncertain factor you mentioned. There may be some problems in your downloaded cropped dataset, and there may be some mis-matches between the cropped dataset and the original LFW dataset. You need to follow exactly the same pipeline with ours.

Besides that, you might also want to retrain the network multiple times to see whether it is due to bad luck.

As for the 99.27% for the pretrained model, I have explained it in #93. We can successfully obtain 99.3%.

zeakey commented 6 years ago

@wy1iu Thanks for your reply. So how many GPUs do you use in your training? This is related to the effective batch-size during training. From you guidance ./code/sphereface_train.sh 0,1 and default hyper-parameters, I use two GPUs with 256 batch-size for each. Then the effective batch-size is 256x2.

I've posted my system environment in #93 , which may help us figure out the performance gap.

wy1iu commented 6 years ago

The detailed setting is available in the training log we released. The provided models are trained using the exact same setting in the repository. As for some other unpredictable issues like version of caffe, cuda or cudnn, I am not sure how they will affect the training.

wy1iu / sphereface

Reproducing the performance on LFW? #97