About landmark loss - Githubissues

NNNNAI commented 3 years ago

Thanks for sharing your work! Can you tell me in detail how you did this landmark loss, which is to present the langmark in the form of a heatmap of [h,w,1], and then calculate the l2loss for the input and the generated image? Many thanks！

zyainfal commented 3 years ago

You are almost right, except we use the heatmap of [h,w,c], where c represents the number of eye&mouth points. Then, l2 loss is calculated for the input and the generated image.

NNNNAI commented 3 years ago

Thanks for your quick reply, could you tell me what the specific number of c is here, because I saw that there are several detection modes for the number of landmarks in High-resolution networks (HRNets) for facial landmark detection.

zyainfal commented 3 years ago

We used COFW version, which has 98 landmarks.（Because I modify their source code, I'm not very sure which version I used, but I can tell that the number of landmarks is 98.） And we only use later 38(c) landmarks.

NNNNAI commented 3 years ago

I see~. BTW, Have you used all 98 points for landmark loss, will the effect be better or worse? Thanks again~.

zyainfal commented 3 years ago

We have. Supervision on all 98 landmarks will change the facial shape, such as eyebrow shape and cheek shape. And this makes the swapped face look-alike the target face, so we only use the eye/mouth landmarks for training.

NNNNAI commented 3 years ago

Thanks for your help~. And I noticed that the HieRFE encoder get image with 256256 as input, is that means the input image with 10241024 need to be reiszed as 256*256 during training?

zyainfal commented 3 years ago

Yes, and this requires much less GPU memory, which means a larger batchsize.

NNNNAI commented 3 years ago

Sorry to disturb you again, I am trying to train the first step of megafaceswap(Reconstruction), I followed all the loss listed in the paper except the landmark loss,cause I find the Hrnet ldm detector can only detect landmark of picture with bbox labels while the reconstruction output did not got this. I have been train 28 epoch on 4 2080ti GPU with total batch size 8. The results is shown below. It looks terrible Hhhhh.Does the lack of ldm loss casuse such a output? Would you be kind to tell me how do you get the bbox label of recontrustion output for getting the ldm loss. Many thanks~,have a nice day. epoch_28_step_8001_fake

zyainfal commented 3 years ago

The lack of ldm loss won't cause terrible results, you may need more training epochs or more training data (e.g. images generated by stylegan version 1). If you still have troubles for training HieRFE, the repo of psp encoder is a good baseline for you. Please note that we use the constant input c1 of stylegan2 while they didn't.

For the usage of HRNet, you can modify their code to drop face crop steps so that you don't need the bbox labels, as the faces in FFHQ and CelebA-HQ have been aligned already.

NNNNAI commented 3 years ago

Thanks for your help!! One more question about the training process .What you are saying in the paper is: Hierfe and FTM are sequentially trained seventeen epochs in total on FFHQ and the auxiliary data. Is this correct? When training FTM, the Hierfe part is fixed, and only FTM participates in the training. Thanks for your help, have a nice day~.

zyainfal commented 3 years ago

Yes, you are correct. And you can also train them together with a learning rate of HieRFE 10x less. Depends on your GPU memory, both training processes should work well.

NNNNAI commented 3 years ago

Ok, I see. Let me try it. Many thanks~.

NNNNAI commented 3 years ago

I have been try to train the FTM. I used mse to calculate the value of Lnorm loss, and now the value of Lnorm itself is about 50. I set the norm weight to 100000 according to your paper, and then the value of norm_loss*norm_weight is 500W. Is this normal? . The values of other loss such as idloss and rec_loss are too small in comparison, and seem to be ineffective. When I calculate the norm value, do I need to make a l2norm for both Lshigh and ls2t and then calculate the norm loss? Many thanks~.

zyainfal commented 3 years ago

OMG, my apology. The loss weights should be 8 for reconstruction, 32 for L_norm, 32 for LPIPS, 24 for identity, and 100,000 for landmarks.

zyainfal commented 3 years ago

and no l2norm needed

HardboiledHu commented 3 years ago

We used COFW version, which has 98 landmarks.（Because I modify their source code, I'm not very sure which version I used, but I can tell that the number of landmarks is 98.） And we only use later 38(c) landmarks.

I checked the code of HRNet for landmark-Detection and found that maybe the version of HRNet is WFLW, because the landmark of COFW is 29 and the number of WFLW is 98. So I want to confirm again which version it is？

zyainfal commented 3 years ago

Well, then use the WFLW version as long as it includes eye & mouth landmark predictions.

seokg commented 3 years ago

Hi, I was also wondering about the implementation details on the landmark prediction loss.

When predicting the landmark coordinate of the given input (using decode_preds function in HRNet repo ), the gradient is not able to pass down to the model. Is there any other way to pass the gradient to the network?

zyainfal commented 3 years ago

You don't need decode_preds function. The landmark loss is calculated on feature maps, not the predicted coordinates.

seokg commented 3 years ago

Thanks for the clarification! @zyainfal

RRdmlearning commented 2 years ago

I user the HRNet and get the landmark like this [98, 64, 64], How do I extract the eyes and mouth parts from it?

zyainfal commented 2 years ago

According to WFLW dataset, you can find the indexes of which channels in 98 channels are used to predict eye and mouth landmarks. Then you can extract them by torch.index_select() function provided by pytorch.

RRdmlearning commented 2 years ago

thanks for your prompt reply I extracted 38d features from WFLW and used arcface loss, reconstruction loss and lpips loss to finetune the first stage.

I have observed that the loss has been declining, but the result of reconstruction is not as good as before training. Could you give me any suggestions?

before finetune(just use your weight): 9FJUAW)HTZ}4Y~ZF7RHBQ7P :

YI1PWH5UXUP(WS95F3IB@~W

after finetune: 64%Q6_CF9X)HQDJBSGSWSSC ZVRQ$TG9XZ$U) 6`)357YIF

zyainfal commented 2 years ago

It is hard to say. Could you please show me your training code? And what is your training data?

If you followed the settings of hyper-parameters and training data introduced in our paper: Do you conduct learning rate decay during the training? If yes, it may lead to over-fitting, so do not use it.

If you changed the hyper-parameter settings, you may need to lower the weight of landmark loss, keep the balance between reconstruction and LPIPS loss, and raise arcface(id) loss.

If you used less training data, then use more.

BTW, the open sourced psp encoder is a good baseline for you.

RRdmlearning commented 2 years ago

I found the answer. It was caused by insufficient training. In fact, it showed a good reconstruction effect on some other pictures.

RRdmlearning commented 2 years ago

I have doubts about the second stage of training. The reconstructed loss is used in the second stage, but the gradient generated by the reconstructed loss seems to be related to Hierfe and not to swapper moudle. In other words, in the second stage, if there is no need to fine-tune Hierfe, should the reconstruction loss be removed?

zyainfal commented 2 years ago

When source = target, it works.

RRdmlearning commented 2 years ago

Thank you for your quick reply, could you explain in more detail?

 #input is source_256, target_256
 lats_source, struct_source = HieRFE(source_256)
 lats_target, struct_target = HieRFE(target_256)

 source_gan = Generator(lats_source, struct_source)
 target_gan = Generator(lats_target, struct_target )
 rec_loss = rec_loss(source_256, target_256, sourge_gan, target_gan)

In the second stage of training, the generator and Hierfe are frozen, this reconstruction loss seems to be useless? How will it work when source=target?

zyainfal commented 2 years ago

In your case, rec_loss can be safely dropped.

What I mean is:

# Input is source_256, target_256
# [*, *] means concated tensor
# Encode
 [lats_source, lats_target],  [struct_source, struct_target] = HieRFE([source_256, target_256])

# swapper(s, t), s --> source, t --> target
 lats_swap = swapper([lats_source, lats_target], [lats_target, lats_target])

# Generate
 [s2t, t2t] = Generator(lats_swap, [struct_target, struct_target])

# Losses
 rec_loss = rec_loss(t2t, target_256)
 ...

For a.more general case, source and target can be replaced by each other, say:

 source_target_source_target = torch.cat([lats_source, lats_target, lats_source, lats_target])
 target_source_source_target = torch.cat([lats_target, lats_source, lats_source, lats_target])
 lats_swap = swapper(source_target_source_target , 
                     target_source_source_target)

# here you get s2t, t2s, s2s, t2t

RRdmlearning commented 2 years ago

You mean the true reconstruction loss (second stage) is like this？：

#input is source_256, target_256
 lats_source, struct_source = HieRFE(source_256)
 lats_target, struct_target = HieRFE(target_256)

 swapped_lats_source = self.swapper(lats_source, lats_source)
 swapped_lats_target = self.swapper(lats_target, lats_target)
 swapped_lats = self.swapper(lats_source, lats_target)

 source_gan = Generator(swapped_lats_source , struct_source)
 target_gan = Generator(swapped_lats_target , struct_target )
 rec_loss = rec_loss(source_256, target_256, sourge_gan, target_gan)

zyainfal commented 2 years ago

Yes. FYI, in our paper, we used

 #input is source_256, target_256
 lats_source, struct_source = HieRFE(source_256)
 lats_target, struct_target = HieRFE(target_256)

 source_gan = Generator(lats_source, struct_source)
 target_gan = Generator(lats_target, struct_target )
 rec_loss = rec_loss(source_256, target_256, sourge_gan, target_gan)

this gives better generalization of HieRFE. (maybe for longer training period)

For now, I believe

#input is source_256, target_256
 lats_source, struct_source = HieRFE(source_256)
 lats_target, struct_target = HieRFE(target_256)

 swapped_lats_source = self.swapper(lats_source, lats_source)
 swapped_lats_target = self.swapper(lats_target, lats_target)
 swapped_lats = self.swapper(lats_source, lats_target)

 source_gan = Generator(swapped_lats_source , struct_source)
 target_gan = Generator(swapped_lats_target , struct_target )
 rec_loss = rec_loss(source_256, target_256, sourge_gan, target_gan)

could give better results.

Both training strategy should work well.

RRdmlearning commented 2 years ago

Thank you very much for your quick reply This solved my doubts

zyainfal commented 2 years ago

Oh wait, I just noticed that in the second stage, HieRFE trained by both swapping losses and reconstruction loss, where swapping losses make it predict better latent codes for face swapping and reconstruction loss makes it hold its fundamental function. The multi-target training gives HieRFE better generalization.

RRdmlearning commented 2 years ago

 #input is source_256, target_256
 lats_source, struct_source = HieRFE(source_256)
 lats_target, struct_target = HieRFE(target_256)

 source_gan = Generator(lats_source, struct_source)
 target_gan = Generator(lats_target, struct_target )
 rec_loss = rec_loss(source_256, target_256, sourge_gan, target_gan)

Yes, but when Hierfe is frozen, reconstruction loss will be useless in the second stage. I think that if the reconstruction loss does not go through the swapper, then the swapper part will not be affected by the reconstruction loss

zyainfal commented 2 years ago

Yes, you are correct : )

zyainfal / One-Shot-Face-Swapping-on-Megapixels

About landmark loss #8