zhen8838 / Circle-Loss

Tensorflow2 implementation of CircleLoss. Support class-level, sparse class-level, pair-wise labels
MIT License
108 stars 40 forks source link

Confusion on comments 'y_pred must be cos similarity' #3

Open seusofthd opened 4 years ago

seusofthd commented 4 years ago

For all CircleLoss implementation, there is a comment on 'y_pred must be cos similarity'. I am a little bit confused. For image classification, it should also accepts logits as y_pred. Is that correct?

Another question I have is for both CircleLoss and SparseCircleLoss, the calculation is correct only if there is one inner class pair(K = 1 for s_p). Is that correct?

seusofthd commented 4 years ago

And for implementing the Sparse version, it should be equivalent to changing the y_true to one hot labels and then directly apply the CircleLoss class's call, is that correct?

zhen8838 commented 4 years ago
  1. for image classification: I use code as following ensure y_pred as cos similarity.

      kl.Lambda(lambda x: tf.nn.l2_normalize(x, 1), name='emmbeding'),
      kl.Dense(10, use_bias=False, kernel_constraint=k.constraints.unit_norm())
  2. yse, But if k > 1, SparseCircleLoss is actually equivalent to PairCircleLoss

  3. yes.

seusofthd commented 4 years ago

Thanks for your quick reply!

for 2 if K >1 I still have some questions. Basically it is a sum of exp(-gamma...s_p...). This means that you cannot take the sum of positive exps as the denominator. So you cannot just use -r_sp_m * self.gamma + logZ as the denominator inside the log. And imagine that we can have two pairs of positive, then r_sp_m's dimension would be [batch_size, 2], which is different fro logZ(dimension is [batch_size, 1])

And for training the cifar, how did you schedule the learning rate, you just use the default lr in Adam? I am trying to implement this for resnet50 training on imagenet, seems it is very sensitive to learning rate.

zhen8838 commented 4 years ago

I think that the optimization goal set by circle loss cannot be achieved, so his loss gradient will be quite large. At the same time, the ap and an is very large due to the large gap during the first few training cycles.

But overall the circle loss is still relatively robust, I suggest that the pre-training is best to reduce the learning rate by 10 times