Closed yjh576 closed 4 years ago
Hi,
Our code you mentioned is exactly consistent with the loss function Eq. (8) in our paper (https://openreview.net/pdf?id=rJlnOhVYPS). I guess you have mistaken Eq. (7) as our loss function, but in fact, Eq. (7) serves for Eq. (8). Please check it again.
Our loss function is a binary cross-entropy loss with soft labels. For example, the conventional binary cross-entropy loss with hard labels is -qlogp-(1-q)log(1-p), where q is either 0 or 1 and p is within [0,1]. In our loss function, q and p are both within [0,1].
Your modification could be seen as -qlogp
, losing half of the regularization.
Thank you for your quick reply. Sorry, I make a mistake. It is really correct that the conventional binary cross-entropy loss with hard labels is -qlogp-(1-q)log(1-p). I understand it. Thank you.
Your work is good. I also have a question. I run your code. I find the follwing code is very important for performance: model_1.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda()) model_2.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda()) model_1_ema.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda()) model_2_ema.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda())
why? What the above code achieve? Removing the above code will lead to the worse performace comparing with the former.
The loss function Eq. (7) is -qlogp-(1-q)log(1-p), where q is either 0 or 1 and p is within [0,1]. In fact, it is -logp because of q = 1. But I find the code is
self.criterion_tri = SoftTripletLoss(margin=0.0).cuda()
triple_dist = torch.stack((dist_ap, dist_an), dim=1)
triple_dist = F.log_softmax(triple_dist, dim=1)
loss = (- self.margin triple_dist[:,0] - (1 - self.margin) triple_dist[:,1]).mean()
I think the above code is -log(1-p), which is not -logp. I can't understand it.
I'm looking forward to your reply!
Thank you for your quick reply. Sorry, I make a mistake. It is really correct that the conventional binary cross-entropy loss with hard labels is -qlogp-(1-q)log(1-p). I understand it. Thank you.
Your work is good. I also have a question. I run your code. I find the follwing code is very important for performance: model_1.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda()) model_2.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda()) model_1_ema.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda()) model_2_ema.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda())
why? What the above code achieve? Removing the above code will lead to the worse performace comparing with the former.
Please refer to https://github.com/yxgeee/MMT/issues/16
The loss function Eq. (7) is -qlogp-(1-q)log(1-p), where q is either 0 or 1 and p is within [0,1]. In fact, it is -logp because of q = 1. But I find the code is self.criterion_tri = SoftTripletLoss(margin=0.0).cuda() triple_dist = torch.stack((dist_ap, dist_an), dim=1) triple_dist = F.log_softmax(triple_dist, dim=1) loss = (- self.margin triple_dist[:,0] - (1 - self.margin) triple_dist[:,1]).mean() I think the above code is -log(1-p), which is not -logp. I can't understand it. I'm looking forward to your reply!
When margin=0.0
, loss = (- self.margin * triple_dist[:,0] - (1 - self.margin) * triple_dist[:,1]).mean()
can be thought of as loss = (- triple_dist[:,1]).mean()
. The value of triple_dist[:,1]
is exactly the same as Eq. (7) in the paper. Please check.
loss = (- triple_dist[:,1]).mean()
means the euclidean distance between anchor and negative should be larger than the euclidean distance between anchor and positive.
-qlogp-(1-q)log(1-p)
is only a simplified formulation of BCE loss. If you want to align this function with our Eq. (6), you should use q=1-self.margin, p=triple_dist[:,1]
where 1-p=triple_dist[:,0]
.
I think that loss = (- triple_dist[:,1]).mean() means the similarity between anchor and negative. The similarity should be smaller than the similarity between anchor and positive.
I think that this function with your Eq. (6) should be loss = (- triple_dist[:,0]).mean(). We hope that the similarity between anchor and positive become lager. This is because hard_p means the large similarity between anchor and positive according to the code. I have some confusion about it. Please refer some code: sorted_mat_distance, positive_indices = torch.sort(mat_distance + (-9999999.) (1 - mat_similarity), dim=1, descending=True) hard_p = sorted_mat_distance[:, 0] hard_p_indice = positive_indices[:, 0] sorted_mat_distance, negative_indices = torch.sort(mat_distance + (9999999.) (mat_similarity), dim=1, descending=False) hard_n = sorted_mat_distance[:, 0] hard_n_indice = negative_indices[:, 0]
Please note that we use Euclidean distance instead of cosine similarity in our code to measure the feature similarity (https://github.com/yxgeee/MMT/blob/master/mmt/loss/triplet.py#L78). The Euclidean distance between anchor and negative should be larger than the Euclidean distance between anchor and positive.
Also in our paper, in Equation (7), we use the root of Euclidean distance, which is also called L2-norm distance.
Larger euclidean distance indicates smaller similarity, and vice versa.
Thank you. I have some confusion about Equation (7) and Equation (2). I think they have the same function. So, I think that.
Sorry, Equation (6) and Equation (2).
Yes, Eq. (2) and Eq. (6) have the same function. Eq. (6) is just a hard-version softmax-triplet loss, which is also supervised by a hard label 0/1. The CORE idea of our paper is Eq. (8), which is a soft-version softmax-triplet loss for supporting mean-teaching. We introduce Eq. (6), because the conventional hard-version triplet loss Eq. (2) does not have a soft-version variant to support mean-teaching.
There are several lines code in function SoftTripletLoss triple_dist = torch.stack((dist_ap, dist_an), dim=1) triple_dist = F.log_softmax(triple_dist, dim=1) mat_dist_ref = euclidean_dist(emb2, emb2) dist_ap_ref = torch.gather(mat_dist_ref, 1, ap_idx.view(N,1).expand(N,N))[:,0] dist_an_ref = torch.gather(mat_dist_ref, 1, an_idx.view(N,1).expand(N,N))[:,0] triple_dist_ref = torch.stack((dist_ap_ref, dist_an_ref), dim=1) triple_dist_ref = F.softmax(triple_dist_ref, dim=1).detach() oss = (- triple_dist_ref * triple_dist).mean(0).sum() return loss I think it should be: triple_dist = torch.stack((dist_ap, dist_an), dim=1) triple_dist = F.log_softmax(triple_dist, dim=1) mat_dist_ref = euclidean_dist(emb2, emb2) dist_ap_ref = torch.gather(mat_dist_ref, 1, ap_idx.view(N,1).expand(N,N))[:,0] dist_an_ref = torch.gather(mat_dist_ref, 1, an_idx.view(N,1).expand(N,N))[:,0] triple_dist_ref = torch.stack((dist_ap_ref, dist_an_ref), dim=1) triple_dist_ref = F.softmax(triple_dist_ref, dim=1).detach()
loss = (- triple_dist_ref * triple_dist).mean(0).sum()
your code is : -log{exp(F(x_i)F(x_i,p))/[exp([F(x_i)F(x_i,p))+exp([F(x_i)F(x_i,n))]} - log{exp([F(x_i)F(x_i,n))/[exp([F(x_i)F(x_i,p))+exp([F(x_i)F(x_i,n))]} , which is not consistent with the loss in your paper. my modified code is : -log{exp(F(x_i)F(x_i,p))/[exp([F(x_i)F(x_i,p))+exp([F(x_i)F(x_i,n))]}, which is consistent with your paper. However, the performace of my modified code is worse than you original code. I can't understand the question above. I'm looking forward to your reply!