Open PhyscalX opened 7 years ago
hi @PhyscalX,
here i just ignore the sign, because it can be dismissed by multiplying the -1 * p_i (p_i - 1)
or -1 * p_j * p_i
, which is the derivative of normal cross-entropy loss using softmax function. you can see the line 141
and 144
in .cu file.
Your mathematic consideration is right, the sign is dismissed. But the "(power_prob_data[ind_i] / (1 - prob_data[ind_i])) " multiplied by "(prob_data[ind_i] - 1)", lead to "-power_prob_data[ind_i]" at the same time, actually it should be "power_prob_data[ind_i]"
Besides, directly log(p) is dangerous, because the lower bound of softmax outputs can be very small. Prefer to add a eps(e.g. 1e-10) before.
Besides,your loss is wrong also.
in code, "loss[index] = -log(max(power_prob_data[ind] log_prob_data[ind], Dtype(FLT_MIN)));
howover, it should be
"loss[index] = -power_prob_data[ind] log_prob_data[ind];"
hi @PhyscalX,
you are right, the loss is computed wrong, and thanks for reminding of log
operation. i will update my code tomorrow and do some test.
for the first term you mention, i need to double check.
thanks again.
I have testified my idea on cifar10-quick, it is right,got the similar val acc as original loss @(alpha=1.0/0.75/0.5/0.25, gamma=2.0)
eps is very important in focal loss, all divisions in your code are dangerous, when alpha > 0.25, it tends to encounter NaN in cifar10-quick at the end of convergence.
I preset eps as 1e-10 in my framework Dragon. If you are interested in it, check the following codes:
(op_kernel.h, line 336), the declaration of kernel::SparseSoftmaxFocalLoss (op_kernel.cc, line 777), the CPU implementation of kernel::SparseSoftmaxFocalLoss (op_kernel.cu, line 1417), the CUDA implementation of kernel::SparseSoftmaxFocalLoss (sparse_softmax_focal_loss_op.h), the declaration of SparseSoftmaxFocalLossOp (sparse_softmax_focal_loss_op.cc), the implementation of SparseSoftmaxFocalLossOp
hi @PhyscalX,
so thanks a lot. I have fixed the problems that u tell me. for the gradient, i forgot to derivate the (1 - p_t), to ignore the sign. now i add it back.
thanks again.
hi @PhyscalX,
you're right, eps is very important, i add it to solve the NaN
problem. Right now, it can run normally, you can have a look.
thanks for pointing out my error and giving so useful suggestions.
And the last, Recommend you to multiply "grad" by prob_data[ind_i]. Directly dividing it may still lead to potential numerical issues. Formulate in ONE way,and Implement in ANOTHER. There are enormous tricks in programming mathematical formulations.
hi @PhyscalX,
i have updated. thanks a lot.
in code,“gamma (power_prob_data[ind_i] / (1 - prob_data[ind_i])) log_prob_data[ind_i]” howover,if(i==j), the (prob_data[ind_i] - 1) should make it as "-gamma (power_prob_data[ind_i] / (1 - prob_data[ind_i])) log_prob_data[ind_i]" otherwise,it turns to be a gradient ascent optimization.