xuebinqin / DIS

This is the repo for our new project Highly Accurate Dichotomous Image Segmentation
Apache License 2.0
2.22k stars 258 forks source link

Why use 1024x1024 output instead of 512x512, up sampling doesn't increase the real resolution? #61

Closed faruknane closed 1 year ago

faruknane commented 1 year ago

Hi @xuebinqin ,

I looked at the ISNET model code and saw that the model always takes 1024x1024 input and produces 1x512x512 (d1 before up sample) output which later on is up-sampled to 1x1024x1024 (in the model.forward code). This is also the same for the GT Encoder. So, it makes no sense to up sample 1x512x512 to 1x1024x1024 in my opinion because the real output resolution is always 512x512.

image

Could you explain the reasoning behind this?

xuebinqin commented 1 year ago

not exactly. set the input as 1024x1024 will gain more details and at the same time the receptive fields will be different from 512x512 input. More importantly, upsample from 512x512 to 1024x1024 at the end seems just an upsample operation. Please remember the supervision is conducted on 1024x1024. The model will be forced to produce more suitable 512x512 maps for the upsampling operation to achieve details. Later on, if you don't care about the memory costs, you can also switch the d0 line one and retrain the model to enable one more convolution on high resolution maps for achieving more details.

On Sat, Dec 3, 2022 at 11:54 AM Akif Faruk Nane @.***> wrote:

Hi @xuebinqin https://github.com/xuebinqin ,

I looked at the ISNET model code and saw that the model always takes 1024x1024 input and produces 1x512x512 (d1 before up sample) output which later on is up-sampled to 1x1024x1024 (in the model.forward code). This is also the same for the GT Encoder. So, it makes no sense to up sample 1x512x512 to 1x1024x1024 in my opinion because the real output resolution is always 512x512.

[image: image] https://user-images.githubusercontent.com/37745467/205459314-77bdc3ce-f7c4-41e6-92bf-b018e98c7a3a.png

Could you explain the reasoning behind this?

— Reply to this email directly, view it on GitHub https://github.com/xuebinqin/DIS/issues/61, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORKZAWZ75K2EQ2KDWM3WLOQQFANCNFSM6AAAAAASS5F2AQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage: https://xuebinqin.github.io/

faruknane commented 1 year ago

not exactly. set the input as 1024x1024 will gain more details and at the same time the receptive fields will be different from 512x512 input. More importantly, upsample from 512x512 to 1024x1024 at the end seems just an upsample operation. Please remember the supervision is conducted on 1024x1024. The model will be forced to produce more suitable 512x512 maps for the upsampling operation to achieve details. Later on, if you don't care about the memory costs, you can also switch the d0 line one and retrain the model to enable one more convolution on high resolution maps for achieving more details. On Sat, Dec 3, 2022 at 11:54 AM Akif Faruk Nane @.> wrote: Hi @xuebinqin https://github.com/xuebinqin , I looked at the ISNET model code and saw that the model always takes 1024x1024 input and produces 1x512x512 (d1 before up sample) output which later on is up-sampled to 1x1024x1024 (in the model.forward code). This is also the same for the GT Encoder. So, it makes no sense to up sample 1x512x512 to 1x1024x1024 in my opinion because the real output resolution is always 512x512. [image: image] https://user-images.githubusercontent.com/37745467/205459314-77bdc3ce-f7c4-41e6-92bf-b018e98c7a3a.png Could you explain the reasoning behind this? — Reply to this email directly, view it on GitHub <#61>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORKZAWZ75K2EQ2KDWM3WLOQQFANCNFSM6AAAAAASS5F2AQ . You are receiving this because you were mentioned.Message ID: @.> -- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage: https://xuebinqin.github.io/

Yes, I think what you are saying is true, but I'm not talking about the input size. The input size being 1024x1024 will have more details and therefore the model can produce better detailed 512x512 maps (more accurate and sharper maybe). However, I think, if I were to make the model output 512x512 instead of 1024x1024 (the input size is still the same), it would produce exactly the same quality & accuracy. What are your thoughts on that? Just thinking out loud...

Also, I'd like to ask, did you have any chance to experiment with the first decoder output being 1024x1024 (before up sample)?

Thank you so much!

xuebinqin commented 1 year ago

The inference scheme needs to be consistent with the training. Bilinear with add a bit more details especially trained with the scheme, then why not using it. We did test output 1024, but outputing 1024 directly wont improve the perforkance that much, however it will greatly increase the gpu memory costs.

On Sat., Dec. 3, 2022, 1:44 p.m. Akif Faruk Nane, @.***> wrote:

not exactly. set the input as 1024x1024 will gain more details and at the same time the receptive fields will be different from 512x512 input. More importantly, upsample from 512x512 to 1024x1024 at the end seems just an upsample operation. Please remember the supervision is conducted on 1024x1024. The model will be forced to produce more suitable 512x512 maps for the upsampling operation to achieve details. Later on, if you don't care about the memory costs, you can also switch the d0 line one and retrain the model to enable one more convolution on high resolution maps for achieving more details. … <#m_7669795789231832260_m_4406007961573205366_m8068708913753148354> On Sat, Dec 3, 2022 at 11:54 AM Akif Faruk Nane @.> wrote: Hi @xuebinqin https://github.com/xuebinqin https://github.com/xuebinqin https://github.com/xuebinqin , I looked at the ISNET model code and saw that the model always takes 1024x1024 input and produces 1x512x512 (d1 before up sample) output which later on is up-sampled to 1x1024x1024 (in the model.forward code). This is also the same for the GT Encoder. So, it makes no sense to up sample 1x512x512 to 1x1024x1024 in my opinion because the real output resolution is always 512x512. [image: image] https://user-images.githubusercontent.com/37745467/205459314-77bdc3ce-f7c4-41e6-92bf-b018e98c7a3a.png https://user-images.githubusercontent.com/37745467/205459314-77bdc3ce-f7c4-41e6-92bf-b018e98c7a3a.png Could you explain the reasoning behind this? — Reply to this email directly, view it on GitHub <#61 https://github.com/xuebinqin/DIS/issues/61>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORKZAWZ75K2EQ2KDWM3WLOQQFANCNFSM6AAAAAASS5F2AQ https://github.com/notifications/unsubscribe-auth/ADSGORKZAWZ75K2EQ2KDWM3WLOQQFANCNFSM6AAAAAASS5F2AQ . You are receiving this because you were mentioned.Message ID: @.> -- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage: https://xuebinqin.github.io/

Yes, I think what you are saying is true, but I'm not talking about the input size. The input size being 1024x1024 will have more details and therefore the model can produce better detailed 512x512 maps (more accurate and sharper maybe). However, I think, if I were to make the model output 512x512 instead of 1024x1024 (the input size is still the same), it would produce exactly the same quality & accuracy. What are your thoughts on that? Just thinking out loud...

Also, I'd like to ask, did you have any chance to experiment with the first decoder output being 1024x1024 (before up sample)?

Thank you so much!

— Reply to this email directly, view it on GitHub https://github.com/xuebinqin/DIS/issues/61#issuecomment-1336263509, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORPSXAGMWWAFMYAQLUTWLO5KTANCNFSM6AAAAAASS5F2AQ . You are receiving this because you were mentioned.Message ID: @.***>

faruknane commented 1 year ago

@xuebinqin Thank you for your answers. What about using up convolution layers at the end (Up-sampling with Transposed Convolution), did you ever try it? (not saying making the decoder itself work in 1024x1024)

This wouldn't add so much memory cost? image

But would it be helpful for better accuracy?

xuebinqin commented 1 year ago

That is a good point. I personally don't think transpose conv making that much sense. you can think about the difference between (1) transpose conv and (2) bilinear upsample+conv. We did try the (2), which didn't give perceivable differences. The problem about adding this kind of modules may not help that much is because all the deep models are "over-fitting" in some ways on certain training sets. Simply to see, the model is already "overfitting" on training set. The newly added modules are high likely just further over fit the training set. In my point of view, think and study more on the distribution description and comparison are more important than adding "modules". Of course, more efficient and effective modules are also non-trivials and should be encouraged. Anyway, you can give it a try, without experiments no one can tell if it works or not, especially in deep learning. All the best.

On Sat., Dec. 3, 2022, 3:18 p.m. Akif Faruk Nane, @.***> wrote:

@xuebinqin https://github.com/xuebinqin Thank you for your answers. What about using up convolution layers at the end (Up-sampling with Transposed Convolution), did you ever try it? (not saying making the decoder itself work in 1024x1024)

This wouldn't add so much memory cost? [image: image] https://user-images.githubusercontent.com/37745467/205466170-c61095a0-d277-4363-92cc-9d672bc17753.png

But would it be helpful for better accuracy?

— Reply to this email directly, view it on GitHub https://github.com/xuebinqin/DIS/issues/61#issuecomment-1336276644, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORN6AOFFFLWXVFDJRE3WLPILVANCNFSM6AAAAAASS5F2AQ . You are receiving this because you were mentioned.Message ID: @.***>

faruknane commented 1 year ago

Thank you for your answers again. I hope I understand what you mean correctly. If I can experiment about it on my own, I will be happy to let you know about the results. The modules are kind of important to create an ability to learn by providing mathematical pathways. But an ability to learn doesn't guarantee to learn well. Also other things are important to be taken care of as you mentioned such as over fitting.

I have been following your research from when u2net paper is published. Probably, I will read research papers you publish in the future. Maybe, from time to time you might see me in your github issues 😅

// if no more replies, I or you might close the issue. Thank you again!

xuebinqin commented 1 year ago

Yes, exactly. That's why in our paper we suggest more researchers to pay a bit more attention to the (data) distribution description and analysis. Great modules will help to provide practically usable models for the society, which has been greatly developed in the past few years. Distribution analysis could be a great starting point to unveil the essence of the "black box" (deep models).

On Sat, Dec 3, 2022 at 4:25 PM Akif Faruk Nane @.***> wrote:

Thank you for your answers again. I hope I understand what you mean correctly. If I can experiment about it on my own, I will be happy to let you know about the results. The modules are kind of important to create an ability to learn by providing mathematical pathways. But an ability to learn doesn't guarantee to learn well. Also other things are importamt as you mentioned such as over fitting.

— Reply to this email directly, view it on GitHub https://github.com/xuebinqin/DIS/issues/61#issuecomment-1336286275, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORMONA4UKPMPP3L4CYTWLPQGBANCNFSM6AAAAAASS5F2AQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage: https://xuebinqin.github.io/