loss cannot decrease when training and some bugs in train_multi_gpu.py

yule-li / CosFace

Tensorflow implementation for paper CosFace: Large Margin Cosine Loss for Deep Face Recognition

285 stars 99 forks source link

loss cannot decrease when training and some bugs in train_multi_gpu.py #9

Open LCorleone opened 5 years ago

LCorleone commented 5 years ago

hi, when I am training on the webface, I find that the loss cannot decrease. My network is sphere network and the loss is softmax. Can anyone tell me the loss when convergence and how many epochs you trained？ Thanks！

LCorleone commented 5 years ago

https://github.com/yule-li/CosFace/blob/42648490c882c0b85718861b3e3bf03917ec745b/train/train_multi_gpu.py#L244 This line miss the tf,add_n? https://github.com/yule-li/CosFace/blob/42648490c882c0b85718861b3e3bf03917ec745b/train/train_multi_gpu.py#L241 why regularization_losses should multiply weight decay when use sphere network? what's more, is the parameters in train.sh the default setting when you train your model? I really cannot converge even use softmax loss with sphere network. Could you please give me some details like the loss value or num epoches and so on. I am really appreciate for that!

yule-li commented 5 years ago

The difference of reg_loss between sphere_network and networks in tf.slim is because tf.slim has multiplied args.weight_decay for each regularization item of weight parameter. So our implemented network should do it by ourself.

The parameters in train.sh were only choiced for CosFace Loss with 1024 feature embedding dim. If you use softmax losss, you may set different learning rate like lr_coco.txt. The loss value of ``softmax loss``` may be about 0.2 after 60000 iterations.

LCorleone commented 5 years ago

@yule-li Okay. Thanks very much, I will do more experiments and check my codes. Thanks again!

d12306 commented 5 years ago

Hello, @yule-li , first thanks for implementing the algorithm. When I was using your code for training on the casia dataset, the cos loss doesn't decrease much after 100 iterations. Is there something wrong with the learning rate (a little bit large in your txt file) or something else. Hopefully you can help me with this issue. Thanks.