The first term of “grad” seems to be wrong

zimenglan-sysu-512 / Focal-Loss

loss layer of implementation

125 stars 41 forks source link

The first term of “grad” seems to be wrong #2

Open PhyscalX opened 7 years ago

PhyscalX commented 7 years ago

in code，“gamma (power_prob_data[ind_i] / (1 - prob_data[ind_i])) log_prob_data[ind_i]” howover，if(i==j), the (prob_data[ind_i] - 1) should make it as "-gamma (power_prob_data[ind_i] / (1 - prob_data[ind_i])) log_prob_data[ind_i]" otherwise，it turns to be a gradient ascent optimization.

zimenglan-sysu-512 commented 7 years ago

hi @PhyscalX,

here i just ignore the sign, because it can be dismissed by multiplying the -1 * p_i (p_i - 1) or -1 * p_j * p_i, which is the derivative of normal cross-entropy loss using softmax function. you can see the line 141 and 144 in .cu file.

PhyscalX commented 7 years ago

Your mathematic consideration is right, the sign is dismissed. But the "(power_prob_data[ind_i] / (1 - prob_data[ind_i])) " multiplied by "(prob_data[ind_i] - 1)", lead to "-power_prob_data[ind_i]" at the same time, actually it should be "power_prob_data[ind_i]"

Besides, directly log(p) is dangerous, because the lower bound of softmax outputs can be very small. Prefer to add a eps(e.g. 1e-10) before.

Besides，your loss is wrong also. in code， "loss[index] = -log(max(power_prob_data[ind] log_prob_data[ind], Dtype(FLT_MIN))); howover， it should be
"loss[index] = -power_prob_data[ind] log_prob_data[ind];"

zimenglan-sysu-512 commented 7 years ago

hi @PhyscalX,

you are right, the loss is computed wrong, and thanks for reminding of log operation. i will update my code tomorrow and do some test.

for the first term you mention, i need to double check.

thanks again.

PhyscalX commented 7 years ago

I have testified my idea on cifar10-quick, it is right，got the similar val acc as original loss @(alpha=1.0/0.75/0.5/0.25, gamma=2.0)

eps is very important in focal loss, all divisions in your code are dangerous, when alpha > 0.25, it tends to encounter NaN in cifar10-quick at the end of convergence.

I preset eps as 1e-10 in my framework Dragon. If you are interested in it, check the following codes:

(op_kernel.h, line 336), the declaration of kernel::SparseSoftmaxFocalLoss (op_kernel.cc, line 777), the CPU implementation of kernel::SparseSoftmaxFocalLoss (op_kernel.cu, line 1417), the CUDA implementation of kernel::SparseSoftmaxFocalLoss (sparse_softmax_focal_loss_op.h), the declaration of SparseSoftmaxFocalLossOp (sparse_softmax_focal_loss_op.cc), the implementation of SparseSoftmaxFocalLossOp

zimenglan-sysu-512 commented 7 years ago

hi @PhyscalX,

so thanks a lot. I have fixed the problems that u tell me. for the gradient, i forgot to derivate the (1 - p_t), to ignore the sign. now i add it back.

thanks again.

zimenglan-sysu-512 commented 7 years ago

hi @PhyscalX,

you're right, eps is very important, i add it to solve the NaN problem. Right now, it can run normally, you can have a look.

thanks for pointing out my error and giving so useful suggestions.

PhyscalX commented 7 years ago

And the last, Recommend you to multiply "grad" by prob_data[ind_i]. Directly dividing it may still lead to potential numerical issues. Formulate in ONE way，and Implement in ANOTHER. There are enormous tricks in programming mathematical formulations.

zimenglan-sysu-512 commented 7 years ago

hi @PhyscalX,

i have updated. thanks a lot.