Questions regarding GT encoder

Hi,

Thanks for the interesting work.

Just reviewing the paper here, I've got several questions regarding GT encoder.

You described GT encoder to be self-supervised -- Did you mean this as implementing auto-encoder?
In Figure 5(b), the depicted Ground Truth Encoder has only the encoder part -- does this mean that I only need to train the encoder part (not decoder), targeting the GT?
Again in Figure 5(b), do I need to do extra upsampling to make the result of the encoder to have the same size as input? (For clarity, if I put 3X1024X1024 as input, after the green conv layer, the input reshapes into 16X512X512. And after going through EN_1, it will reshape to 64X512X512 and upsample to 3X512X512 (Assuming the upsampling used for u2net is used here in the same manner). Now, the question is how can we compare the upsampled result of EN_1 (3X512X512) and original input (3X1024X1024) in BCE loss calculation?

-- One thought I had was temporarily adding extra upsampling layers for encoders while training the GT-encoder and remove those upsampling layers once I freeze the weights for GT-encoder. Would this be a viable option or did you mean something else?

Thanks in advance :)

Thanks for your interests.

Hi,

Thanks for the interesting work.

Just reviewing the paper here, I've got several questions regarding GT encoder.

You described GT encoder to be self-supervised -- Did you mean this as implementing auto-encoder? RES: It is very similar to auto-encoder. Here, we are trying to overfit the ground truth mask.

In Figure 5(b), the depicted Ground Truth Encoder has only the encoder part -- does this mean that I only need to train the encoder part (not decoder), targeting the GT? RES: Yes, you just need to train the encoder part for overfitting the GT. We also tried to use the full u2net as GT encoder, there isn't much differences. But the differences may also depends on your dataset. For overfitting the GT, only around 2000 iteration is enough.

Again in Figure 5(b), do I need to do extra upsampling to make the result of the encoder to have the same size as input? (For clarity, if I put 3X1024X1024 as input, after the green conv layer, the input reshapes into 16X512X512. And after going through EN_1, it will reshape to 64X512X512 and upsample to 3X512X512 (Assuming the upsampling used for u2net is used here in the same manner). Now, the question is how can we compare the upsampled result of EN_1 (3X512X512) and original input (3X1024X1024) in BCE loss calculation? RES: You can either downsample the GT or upsample the final output. We suggest the later one if you have enough GPU memory.

xuebinqin / DIS

Questions regarding GT encoder #2