Paper in details - Githubissues

JustinKai0527 commented 7 months ago

Very interesting work!!! Q1. I want to ask the section 3.3. Task-Plugin, you want to get task-specific visual guidance prior and spatial prior to learn the part we want to remove and preserve the image, but the training loss initial latent is encoded by ground truch image which is "clean" image? How to let the TPB&SCB learn to clean the unclean image? Q2. I want to check that my understanding about the overall pipeline. So I give a image and text that I want to do desnow, the image will feed to task-plugin to get two prior, and then using plugin-selector to compute similarity between the text. And choose the prior which similarity exceeds the threshold, feed the task-specific visual guidance prior to cross-attn, spatial prior to final stage of decoder? Thx!!!

yuhaoliu7456 commented 7 months ago

Thanks for your interest.

For Q1, during the diffusion training, we regard the GT as $x_{0}$ and add noise at different extents. As the input of our Task-Plugin is a degraded image, the whole diffusion training is actually a conditional removal process, which has to learn how to model the relationship between the input degraded image and noisy GT.

For Q2, yes, your understanding is exactly right.

JustinKai0527 commented 7 months ago

@yuhaoliu7456 Thanks for quickly responding! In Q1, what I got confused about is that the GT image, I think, is a clean image, which means it is not a snow or rain image. However, you can train the task-plugin to learn how to clean up the unwanted details. Why? From my understanding, the loss term is basically a denoising term in diffusion model which learn how to reconstruct the image.

yuhaoliu7456 commented 7 months ago

Because the input of the task-plugin is a degraded image(e.g., snow image). During the diffusion/reconstruction process, the snow image is actually a conditional input, and this force the model to get rid of the snow pattern from the task-plugin such that the reconstruction process can be done.

发件人: Justin Kai @.> 发送时间: Monday, April 29, 2024 12:12:54 PM 收件人: yuhaoliu7456/Diff-Plugin @.> 抄送: YuhaoLiu @.>; Mention @.> 主题: Re: [yuhaoliu7456/Diff-Plugin] Paper in details (Issue #6)

@yuhaoliu7456https://github.com/yuhaoliu7456 Thanks for quickly reponding! In Q1, what I got confused about is that the GT image, I think, is a clean image, which means it is not a snow or rain image. However, you can train the task-plugin to learn how to clean up the unwanted details. Why? From my understanding, the loss term is basically a denoising term in diffusion model which learn how to reconstruct the image.

― Reply to this email directly, view it on GitHubhttps://github.com/yuhaoliu7456/Diff-Plugin/issues/6#issuecomment-2081860063, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKXCWCT2ZTSXQLU5CCSHWHTY7XCENAVCNFSM6AAAAABG4ZSVOOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRHA3DAMBWGM. You are receiving this because you were mentioned.Message ID: @.***>

JustinKai0527 commented 7 months ago

@yuhaoliu7456 Thanks for replying. Okay, so you have a paired image that is snow image and w/o snow image. In the picture, you feed snow image to get the prior, and the I_hat in loss term below is w/o snow image? So in training, the prior got is snow image prior and try to recontruct the z_t which is encoded by I_hat(w/o snow image)?

lyf1212 commented 6 months ago

@yuhaoliu7456 Thanks for replying. Okay, so you have a paired image that is snow image and w/o snow image. In the picture, you feed snow image to get the prior, and the I_hat in loss term below is w/o snow image? So in training, the prior got is snow image prior and try to recontruct the zt which is encoded by I_hat(w/o snow image)?

I think it is the right understanding. In other words, you can also treat this method as a unified ControlNet, which basically takes the snowy/rainy/noisy... images as conditional input of ControlNet and guide the diffusion process by the middle features insertion and the genious design of zero-conv by ControlNet architechture. The main contribution of this work is they used CLIP Image Encoder to extract the visual embedding of those degradation images which will be aligned with those "degradation descriptions" and designed a task selector by a basic contrastive loss and a newly proposed dataset. Also looking forward for the official reply from the author~

yuhaoliu7456 commented 5 months ago

The main contribution of our work is the whole framework, not just the control-based encoder. Please take a look at our main paper for more detailed idea. If there are no other questions, I will close this comment later. Thanks.

How-Wang commented 5 months ago

Initially, 𝑧0 does not have degradation (though it contains noise), and the final clean image 𝑧𝑡 also does not have degradation. Why they start with the GT image? I think the key point during inference is that this paper does not employ DDIM inversion!

Therefore, TPB is not actually about "removing" degradation but guiding the model to avoid generating degradation. The information from SCB which containing actually spatial info and hoping remove the degradation info by SCB itself is feed in the last step.

By using 𝐹𝑝 which initially contains both image content and degradation information, we train to isolate only the degradation information. This allows the denoising process to understand what the degradation looks like and avoid generating it.

yuhaoliu7456 / Diff-Plugin

Paper in details #6