Losses, learning rate and training protocol

alessiabertugli commented 3 years ago

Hi,

First of all, congratulations on the job. I have some questions about the training details of your method. I am trying to reproduce your results with a similar network/training protocol, but I have some doubts. I obtained good reconstruction results from stage 1 (HieRFE training), but during stage 2 (HieRFE frozen, FTM training) the swap seems not to work at all (I still obtain a copy of the origin).

Loss weights In the paper, you reported the weights of the losses used for stage 1 (HieRFE training) and stage 2 (FTM training), but I am not sure they are completely correct. For example, for stage 2 I think that the value 100000 is associated with the landmark loss, not with the L_norm. Also, the weight of 1 for the L2 loss seems low to me. Can you report all the weights for both stage 1 and stage 2, please?

Learning rate A learning rate of 0.01 is reported for both stage 1 and stage 2. However, I find that with a so high learning rate the model diverges after few epochs. The major difference from your implementation is that I am training also the generator at the same time. Do you think this could be the reason? Can you provide a detailed explanation on how do you pre-trained the StyleGAN2 generator?

Training protocol You said that you trained on sequential stages for HieRFE and FTM modules. What about train jointly the HieRFE+FTM+StyleGAN2 generator for stage 1, and then frozen the HieRFE for stage 2 and fine-tune FTM+StyleGAN2, and then frozen the HieRFE+FTM and fine-tune the StyleGAN2 along with the discriminator for stage 3? Did you try this?

Thank you very much for your help.

zyainfal commented 3 years ago

Hi, I hope this could be helpful:

Loss weights In stage 1, the loss weights should be correct. In stage 2, the loss weights should be 8 for reconstruction, 32 for L_norm, 32 for LPIPS, 24 for identity, and 100,000 for landmarks. This is my fault.

Learning rate We have tried to fine-tune the generator in stage 2, but this differs nothing from a frozen generator. In your case, the incorrect loss weights may be the key problem.

Training protocol We haven't tried the training protocol you mentioned. I'm not sure if this gives better results. However, you can train HieRFE and FTM together with a learning rate of HieRFE 10x less in stage 2.

alessiabertugli commented 3 years ago

Hi,

Thank you very much for your prompt response. I tried to train the model as you did. First, I trained the StyleGAN2 generator and then the other two modules with the losses you indicated. However, I still have the same result: the reconstructions are good, but no swapping occurs. The target and the swapped images are similar, without any swap. From your experience, what can cause this problem? Thanks again for your help.

Alessia

zyainfal commented 3 years ago

I suppose you didn't use the loss functions in a right way. In stage 2, the way to use losses is different from reconstruction training, as shown in eq.9 to eq.12. In short, X_s, X_t and y_s2t are used to calculate losses, not X and X_hat. You need to take care of the source and target pair and they can be the target and source of each other. For example:

# Training loop
    faces = img.cuda()
    latents = HieRFE(faces)
    reconstructed_faces = StyleGAN2(latents) 

    source_latents = latents[0::2]
    target_latents  = latents[1::2]
    source_target = torch.cat([source_latents, target_latents])
    target_source = torch.cat([target_latents, source_latents])
    swapped_latents = FTM(source_target , target_source)

    swapped_faces = StyleGAN2(swapped_latents)

   # Loss calculation according to eq.9 to eq.13, and gather them by eq.14 (here you need to think clear that which is the source and which is the target of each element in swapped_faces)
   # BP and update

If this is not the reason, you may try to lower the coefficient of reconstruction loss and raise the coefficient of ID loss.

alessiabertugli commented 3 years ago

Thank you for your response. I think the problem is in how I trained the generator. So, I want to use your stylegan2 checkpoint to verify if my suspects are true. However, I spent 2 days trying to set up the environment without success. I also follow issue #6 but nothing works for me. I tied literally everything but it always gives problems related to cuda. The closest to working solution I found, starting from the procedure proposed by https://github.com/rosinality/stylegan2-pytorch, is the following: 1) Create a conda env with python 3.6 (since StyleGAN2 requires TensorFlow 1.15 that is not compatible with higher python version 2) conda install pytorch==1.5.1 torchvision==0.6.1 cudatoolkit=10.1 -c pytorch -c hcc 3) apt install nvidia-cuda-toolkit gcc-7 4) pip install tensorflow-gpu==1.15.0; pip install scipy==1.3.3; pip install Pillow==6.2.1; pip install requests==2.22.0 5) python convert_weight.py --repo ../stylegan2 ../stylegan2-ffhq-config-f.pkl

I also tried with cudatoolkit=10.0 and tensorflow-gpu==1.14.0. With this procedure the error shows "RuntimeError: No GPU devices found". Could you please write in detail the steps to create the environment necessary to make StyleGAN2 work, starting from an empty environment? Thank you very much for your time.

A.

zyainfal commented 3 years ago

Sorry, I cannot remember the very details. But I still have some thing for you:

The linux system I used is centos 7. (as you used apt install, I suppose it should be ubuntu)
Do not use apt install nvidia-cuda-toolkit gcc-7, install GPU driver and CUDA from NVIDIA website after checking the version compatibility.
Install tensorflow first and then pytorch. Sometimes you would find one of them lost even you have installed, in this case, reinstall the missing one and it should work.
The last thing is GCC version, you can find the version you need if it is not satisfied (reported as an error).

usmancheema89 commented 2 years ago

@alessiabertugli is there any chance you will be willing to share the training code?

zyainfal / One-Shot-Face-Swapping-on-Megapixels

Losses, learning rate and training protocol #14