yxgeee / MMT

[ICLR-2020] Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification.
https://yxgeee.github.io/projects/mmt
MIT License
472 stars 73 forks source link

The matching problem of loss function and corresponding code #18

Closed yjh576 closed 4 years ago

yjh576 commented 4 years ago

There are several lines code in function SoftTripletLoss triple_dist = torch.stack((dist_ap, dist_an), dim=1) triple_dist = F.log_softmax(triple_dist, dim=1) mat_dist_ref = euclidean_dist(emb2, emb2) dist_ap_ref = torch.gather(mat_dist_ref, 1, ap_idx.view(N,1).expand(N,N))[:,0] dist_an_ref = torch.gather(mat_dist_ref, 1, an_idx.view(N,1).expand(N,N))[:,0] triple_dist_ref = torch.stack((dist_ap_ref, dist_an_ref), dim=1) triple_dist_ref = F.softmax(triple_dist_ref, dim=1).detach() oss = (- triple_dist_ref * triple_dist).mean(0).sum() return loss I think it should be: triple_dist = torch.stack((dist_ap, dist_an), dim=1) triple_dist = F.log_softmax(triple_dist, dim=1) mat_dist_ref = euclidean_dist(emb2, emb2) dist_ap_ref = torch.gather(mat_dist_ref, 1, ap_idx.view(N,1).expand(N,N))[:,0] dist_an_ref = torch.gather(mat_dist_ref, 1, an_idx.view(N,1).expand(N,N))[:,0] triple_dist_ref = torch.stack((dist_ap_ref, dist_an_ref), dim=1) triple_dist_ref = F.softmax(triple_dist_ref, dim=1).detach()

loss = (- triple_dist_ref * triple_dist).mean(0).sum()

    loss = (- triple_dist_ref[:,0] * triple_dist[:,0]).mean()
            return loss

your code is : -log{exp(F(x_i)F(x_i,p))/[exp([F(x_i)F(x_i,p))+exp([F(x_i)F(x_i,n))]} - log{exp([F(x_i)F(x_i,n))/[exp([F(x_i)F(x_i,p))+exp([F(x_i)F(x_i,n))]} , which is not consistent with the loss in your paper. my modified code is : -log{exp(F(x_i)F(x_i,p))/[exp([F(x_i)F(x_i,p))+exp([F(x_i)F(x_i,n))]}, which is consistent with your paper. However, the performace of my modified code is worse than you original code. I can't understand the question above. I'm looking forward to your reply!

yxgeee commented 4 years ago

Hi,

Our code you mentioned is exactly consistent with the loss function Eq. (8) in our paper (https://openreview.net/pdf?id=rJlnOhVYPS). I guess you have mistaken Eq. (7) as our loss function, but in fact, Eq. (7) serves for Eq. (8). Please check it again.

Our loss function is a binary cross-entropy loss with soft labels. For example, the conventional binary cross-entropy loss with hard labels is -qlogp-(1-q)log(1-p), where q is either 0 or 1 and p is within [0,1]. In our loss function, q and p are both within [0,1].

yxgeee commented 4 years ago

Your modification could be seen as -qlogp, losing half of the regularization.

yjh576 commented 4 years ago

Thank you for your quick reply. Sorry, I make a mistake. It is really correct that the conventional binary cross-entropy loss with hard labels is -qlogp-(1-q)log(1-p). I understand it. Thank you.

Your work is good. I also have a question. I run your code. I find the follwing code is very important for performance: model_1.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda()) model_2.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda()) model_1_ema.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda()) model_2_ema.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda())

why? What the above code achieve? Removing the above code will lead to the worse performace comparing with the former.

yjh576 commented 4 years ago

The loss function Eq. (7) is -qlogp-(1-q)log(1-p), where q is either 0 or 1 and p is within [0,1]. In fact, it is -logp because of q = 1. But I find the code is
self.criterion_tri = SoftTripletLoss(margin=0.0).cuda() triple_dist = torch.stack((dist_ap, dist_an), dim=1) triple_dist = F.log_softmax(triple_dist, dim=1) loss = (- self.margin triple_dist[:,0] - (1 - self.margin) triple_dist[:,1]).mean() I think the above code is -log(1-p), which is not -logp. I can't understand it. I'm looking forward to your reply!

yxgeee commented 4 years ago

Thank you for your quick reply. Sorry, I make a mistake. It is really correct that the conventional binary cross-entropy loss with hard labels is -qlogp-(1-q)log(1-p). I understand it. Thank you.

Your work is good. I also have a question. I run your code. I find the follwing code is very important for performance: model_1.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda()) model_2.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda()) model_1_ema.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda()) model_2_ema.module.classifier.weight.data[:args.numclusters].copy(F.normalize(cluster_centers, dim=1).float().cuda())

why? What the above code achieve? Removing the above code will lead to the worse performace comparing with the former.

Please refer to https://github.com/yxgeee/MMT/issues/16

yxgeee commented 4 years ago

The loss function Eq. (7) is -qlogp-(1-q)log(1-p), where q is either 0 or 1 and p is within [0,1]. In fact, it is -logp because of q = 1. But I find the code is self.criterion_tri = SoftTripletLoss(margin=0.0).cuda() triple_dist = torch.stack((dist_ap, dist_an), dim=1) triple_dist = F.log_softmax(triple_dist, dim=1) loss = (- self.margin triple_dist[:,0] - (1 - self.margin) triple_dist[:,1]).mean() I think the above code is -log(1-p), which is not -logp. I can't understand it. I'm looking forward to your reply!

When margin=0.0, loss = (- self.margin * triple_dist[:,0] - (1 - self.margin) * triple_dist[:,1]).mean() can be thought of as loss = (- triple_dist[:,1]).mean(). The value of triple_dist[:,1] is exactly the same as Eq. (7) in the paper. Please check.

yxgeee commented 4 years ago

loss = (- triple_dist[:,1]).mean() means the euclidean distance between anchor and negative should be larger than the euclidean distance between anchor and positive.

yxgeee commented 4 years ago

-qlogp-(1-q)log(1-p) is only a simplified formulation of BCE loss. If you want to align this function with our Eq. (6), you should use q=1-self.margin, p=triple_dist[:,1] where 1-p=triple_dist[:,0].

yjh576 commented 4 years ago

I think that loss = (- triple_dist[:,1]).mean() means the similarity between anchor and negative. The similarity should be smaller than the similarity between anchor and positive.

I think that this function with your Eq. (6) should be loss = (- triple_dist[:,0]).mean(). We hope that the similarity between anchor and positive become lager. This is because hard_p means the large similarity between anchor and positive according to the code. I have some confusion about it. Please refer some code: sorted_mat_distance, positive_indices = torch.sort(mat_distance + (-9999999.) (1 - mat_similarity), dim=1, descending=True) hard_p = sorted_mat_distance[:, 0] hard_p_indice = positive_indices[:, 0] sorted_mat_distance, negative_indices = torch.sort(mat_distance + (9999999.) (mat_similarity), dim=1, descending=False) hard_n = sorted_mat_distance[:, 0] hard_n_indice = negative_indices[:, 0]

yxgeee commented 4 years ago

Please note that we use Euclidean distance instead of cosine similarity in our code to measure the feature similarity (https://github.com/yxgeee/MMT/blob/master/mmt/loss/triplet.py#L78). The Euclidean distance between anchor and negative should be larger than the Euclidean distance between anchor and positive.

yxgeee commented 4 years ago

Also in our paper, in Equation (7), we use the root of Euclidean distance, which is also called L2-norm distance.

yxgeee commented 4 years ago

Larger euclidean distance indicates smaller similarity, and vice versa.

yjh576 commented 4 years ago

Thank you. I have some confusion about Equation (7) and Equation (2). I think they have the same function. So, I think that.

yjh576 commented 4 years ago

Sorry, Equation (6) and Equation (2).

yxgeee commented 4 years ago

Yes, Eq. (2) and Eq. (6) have the same function. Eq. (6) is just a hard-version softmax-triplet loss, which is also supervised by a hard label 0/1. The CORE idea of our paper is Eq. (8), which is a soft-version softmax-triplet loss for supporting mean-teaching. We introduce Eq. (6), because the conventional hard-version triplet loss Eq. (2) does not have a soft-version variant to support mean-teaching.