Question about Appendix D "Analyzing the Toonify Latent Space"

yuval-alaluf / restyle-encoder

Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement" (ICCV 2021) https://arxiv.org/abs/2104.02699

https://yuval-alaluf.github.io/restyle-encoder/

MIT License

1.03k stars 156 forks source link

Question about Appendix D "Analyzing the Toonify Latent Space" #41

Closed jscarlson closed 3 years ago

jscarlson commented 3 years ago

Would you be able to provide a few additional details about how the toonify StyleGAN2 generator was fine-tuned to get the output in Figure 17, e.g., did the latent spaces of both generators seem "aligned" for all fine-tuning checkpoints you checked? Would you be able to share the code used for creating Figure 17? Thanks!

yuval-alaluf commented 3 years ago

For the toonification experiments, I used the pre-trained generator from Justin Pinkney and Doron Adler. They wrote a short paper on how they performed the fine-tuning: https://arxiv.org/abs/2010.05334 Because they use a particular trick for fine-tuning, called layer swapping, the two latent spaces remain aligned. Therefore, when sampling a latent in the FFHQ latent space, it corresponds to a similar looking image in the toonify latent space. I don't have the code for generating Figure 17, but it more less boils down to this idea:

latent = generate_random_latent()
ffhq_image = ffhq_generator(latent) 
toon_image = toon_generator(latent)

Hope this helps

jscarlson commented 3 years ago

Okay, thank you! Didn't realize that the fine-tuning had happened via "layer swapping". Does this mean in general one shouldn't expect encoder bootstrapping to work unless the other generator has been fine-tuned via layer swapping?

yuval-alaluf commented 3 years ago

Good question. The encoder bootstrapping should work best when two domains are similar, which is the case when you do the layer swapping. In general, the technique will only work if the two domains are aligned. This could also be achieved when you do fine-tuning, for example with StyleGAN-NADA (although I haven't tried this myself). The bootstrapping wouldn't work on something like faces to animals, or something like that. This is because a latent in one generator doesn't correspond to a similar image in the other generator.