Question about the logic behind GT mask encoder

faruknane commented 2 years ago

Hi @xuebinqin ,

I'm having issues understanding the logic behind the GT mask encoder. Generally, self-supervision encoders do not have skip connections between encoding and decoding layers. Again typically, the aim is to have a bottleneck at the middle part, have the model learn the abstract meaningful concepts, and get the most meaningful information from it. However, in ISNET model, you used an encoder consisting of RSU blocks which is like:

I believe there is a chance that the GT encoder can pass all the information from the input to the output without compressing or processing any high-level info since there are skip connections. I see that this was not the case in your training.

May I ask why this was not the case?
How did you develop this idea even though I think there is a chance that it wouldn't do any good at all?
Finally, why do you think this increased the performance of the network? Is it all about forcing the model to learn more stuff instead of learning one map?

I just wonder about your thoughts on this which is very important to me. Thank you!

xuebinqin commented 2 years ago

Thanks for your interest. There are multiple different skip connections. Even the shallowest skip connections are after at least one convolution layer, which is trying to pass the details to the high-dimensional space. So your comments "the GT encoder can pass all the information from the input to the output without compressing or processing any high-level info since there are skip connections” is not very accurate, since the GT encoder is also deeply supervised and we cannot ignore these layers at the bottom. More importantly, the features produced by the GT encoder used for supervising the training of DIS have more channels than the single-channel ground truth. So the gt encoder can be understood like a ground truth decompositor or descriptor other than a compressor. The GT encoder tries to convert the ground truth to another high-dimensional space, which provides supervisions from different perspectives. The motivation is to provide denser supervision for the training process for reducing the overfitting. Of course, there must be other better ways to encode the ground truth. We believe there are more possibilities that need to be explored. In this paper, we just provide one possible way to do that. You can also try to remove the skip connections and to see if that works better, which would be a very interesting topic.

On Sun, Jul 17, 2022 at 4:01 AM Akif Faruk Nane @.***> wrote:

Hi @xuebinqin https://github.com/xuebinqin ,

I'm having issues understanding the logic behind the GT mask encoder. Generally, self-supervision encoders do not have skip connections between encoding and decoding layers. Again typically, the aim is to have a bottleneck at the middle part, have the model learn the abstract meaningful concepts, and get the most meaningful information from it. However, in ISNET model, you used an encoder consisting of RSU blocks which is like:

[image: drawing] https://user-images.githubusercontent.com/37745467/179394877-df9ee752-a507-432c-be7f-6ea13bc97001.png

I believe there is a chance that the GT encoder can pass all the information from the input to the output without compressing or processing any high-level info since there are skip connections. I see that this was not the case in your training.

May I ask why this was not the case?

How did you develop this idea even though I think there is a chance that it wouldn't do any good at all?

Finally, why do you think this increased the performance of the network? Is it all about forcing the model to learn more stuff instead of learning one map?

I just wonder about your thoughts on this which is very important to me. Thank you!

— Reply to this email directly, view it on GitHub https://github.com/xuebinqin/DIS/issues/7, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORLZZBETB5BXJWCCBVLVUPRW3ANCNFSM53ZSEK7Q . You are receiving this because you were mentioned.Message ID: @.***>

-- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage: https://xuebinqin.github.io/

faruknane commented 2 years ago

@xuebinqin Thank you for your reply.

Thanks for your interest. There are multiple different skip connections. Even the shallowest skip connections are after at least one convolution layer, which is trying to pass the details to the high-dimensional space. So your comments "the GT encoder can pass all the information from the input to the output without compressing or processing any high-level info since there are skip connections” is not very accurate, since the GT encoder is also deeply supervised and we cannot ignore these layers at the bottom.

I think there is still a possibility (not really) that it can pass the input directly to the output without processing even though there are convolution layers or kernels. One kernel might be just like the below and pass all the information:

However, now I see your point. We can't ignore other layers and the random initialization of the kernel values which means such a case will never happen, right? So, with the convolution layer at the beginning, the input mask will be distorted a bit no matter what we do or how we initialize the model. Also, the other layers will play a role in distorting the input or some of them in fixing the distortion. Then, the distorted input will be translated to the original input all the way, which the last layer will take different parts of information from different kernel results of the previous conv layer. We take feature maps before the last conv which gives us richer information about the input map. I am just trying to enrich my understanding. Your comments about this are really important to me.

More importantly, the features produced by the GT encoder used for supervising the training of DIS have more channels than the single-channel ground truth. So the gt encoder can be understood like a ground truth decompositor or descriptor other than a compressor. The GT encoder tries to convert the ground truth to another high-dimensional space, which provides supervisions from different perspectives. The motivation is to provide denser supervision for the training process for reducing the overfitting. Of course, there must be other better ways to encode the ground truth. We believe there are more possibilities that need to be explored. In this paper, we just provide one possible way to do that. You can also try to remove the skip connections and to see if that works better, which would be a very interesting topic.

Now, I look at the GT encoder from a different perspective. I get it that it is like a decomposer providing more meaningful features about the mask. Thank you again @xuebinqin ! I have been following your research for a while and I am happy about it. Hope you will develop more amazing ideas!

faruknane commented 1 year ago

Closing the issue.

xuebinqin / DIS

Question about the logic behind GT mask encoder #7