yuval-alaluf / restyle-encoder

Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement" (ICCV 2021) https://arxiv.org/abs/2104.02699
https://yuval-alaluf.github.io/restyle-encoder/
MIT License
1.03k stars 156 forks source link

The visual change/delta at the last iteration is larger than the preceding steps #4

Closed yaseryacoob closed 3 years ago

yaseryacoob commented 3 years ago

I observed that in many images the change between the last image and the one before is significant (sometimes better or worse as far as the human eye. This is for the FFHQ restylepSp pretrained. the tqo examples below from CELEBHQ (10.jpg and 1000.jpg). Why would this be the case? (I am using 20 iters below)

10 1000

thanks

yuval-alaluf commented 3 years ago

Please note that the right most image is the original input image. That is probably why you're seeing a big jump between the last and second to last images. Also note that typically less than 5 steps is enough for convergence. Generally 20 is overkill.

yaseryacoob commented 3 years ago

I am actually looking at the left most 2 images (the final and the one before it). if you compare the gap between the left most 3 images, the delta is quite a bit larger between the last two and the delta between 2 before the last one. As the iterations increase I would have expected the last 2 images to be very close (because by then there is less residual to compensate for). These two examples are the easiest to spot but I saw the same thing for other images (to a lesser degree, and even when I used 5 iterations).

Of course your work is great and I like the approach as you described in the paper.

yuval-alaluf commented 3 years ago

I think the order of the outputs may be a bit confusing. The outputs are shown in order from left to right (the leftmost image is the first outputs followed by the remaining outputs) The input image is then shown on the right.

That is why you're seeing a large delta between the leftmost image and the second image as the first output serves as a rough estimate. After these two images, the deltas should be small.

Did I understand you correctly?

yaseryacoob commented 3 years ago

OK, I finally get it! It is my fault. It all makes sense now. But here is the next question while we are at it, given that too many iterations is an overkill (it literally destroys the reconstruction), (1) why not use a a loss function to prevent further iterations (2) or alternatively why does the encoder generate "significant" residuals when it should generate a delta of zero?

Of course one could do this as post-processing step which won't be as clean.

I am asking this because I already saw situations where things deteriorate at different rates.

yuval-alaluf commented 3 years ago

I think the examples you posted do a good job in showing that the reconstructions can deteriorate with too many iterations. What you proposed in (1) seems interesting. Maybe you could incorporate a loss on the magnitude of the deltas? For example, if we add a constraint that after X steps (e.g., 5) we try to push the deltas to 0, maybe that could help minimize this deterioration? Regarding (2), it seems like most of the changes occur in the fine styles, but I wouldn't necessarily say the deltas should converge to 0 since the encoder wasn't trained to do achieve deltas of 0. In any case, this is definitely an interesting phenomenon.

yaseryacoob commented 3 years ago

I will want to dig deeper into these options in the next few days to see what the new architecture may offer. BTW, there is both high and low frequency information distortion that develops over time. I have seen the high frequency develop in stylegan-ada optimization as well, so I wonder why it should be so in a well structured W+ like yours (the stylgan-ada noise is a different story). One would have thought that a fine control over W+ as you do would avoid it.

I am brainstorming out load, as I have been banging my head to improve the overall inversion quality by any number of ways and keep hitting the wall at the last 10% of quality and wonder...

Thanks again for sharing your code and thoughts. Keep it up.

yuval-alaluf commented 3 years ago

In my experience, the distortion we see develop over time (both here and in optimization) is caused because the inversion begins approaching poorer regions in the StyleGAN's latent space. If we constrain the encoder to remain in better regions (e.g., regions closer to W), this may help mitigate this distortion. And this idea of remaining close to W is exactly what we tried doing with our previous encoder e4e (here's the repo). I plan on uploading the ReStyle+e4e models by tomorrow and maybe with these models we'll see less of this distortion over time, but I expect it will still occur to some extent. In any case, this is definitely interesting to explore further.