ygjwd12345 / TransDepth

Code for Transformers Solve Limited Receptive Field for Monocular Depth Prediction
MIT License
172 stars 20 forks source link

Inconsistency between the text and code #10

Closed Mathilda88 closed 3 years ago

Mathilda88 commented 3 years ago

Hi,

Thanks for the great work. Actually, in Fig. 2 of the paper it is written that "" stands for convolution. For example I_{r-->r}^{i}f_{r} in Eq. (8) means these two maps get convolved together. However, in code you just use an element-wise multiplication between these two feature maps.

My second question is about unfolding. It seems that after unfolding the input variable (https://github.com/ygjwd12345/TransDepth/blob/0a7422c6d816429b9f3fc4cca19d93de8cd1ab8a/pytorch/AttentionGraphCondKernel.py#L101), we get an output with the same spatial size but 9 additional channels, in addition to the previous channels we are already provided. I was just wondering if the spatial content was preserved by this type of unforlding, I mean if we sample the top right corner of the spatial maps, whether all the channels are from the same spatial location in the original map.

Thanks,

ygjwd12345 commented 3 years ago

Thanks for your attention. For Q1, actually we do get convolved step by step : unfold-->element-wise multiplication-->sum For Q2, I think the main problem has been answered by A1. "9 additional channels" is the kernel size^2, which is not related to spatial content.

Mathilda88 commented 3 years ago

Thank you so much for your response. Actually, I didn't understand what does the unfolding do here for us? You mean it is the same as getting a copy from a feature map for 9 times and then storing them as a new dimension?

Mathilda88 commented 3 years ago

I know that it extracts a rolling blocks from the spatial dimension but here I can't imagine what it looks like in practice. May you please a bit elaborate on it.

ygjwd12345 commented 3 years ago

def unfold(input, kernel_size, dilation=1, padding=0, stride=1):

type: (Tensor, BroadcastingList2[int], BroadcastingList2[int], BroadcastingList2[int], BroadcastingList2[int]) -> Tensor # noqa

r"""Extracts sliding local blocks from a batched input tensor.

.. warning::
    Currently, only 4-D input tensors (batched image-like tensors) are
    supported.

.. warning::

    More than one element of the unfolded tensor may refer to a single
    memory location. As a result, in-place operations (especially ones that
    are vectorized) may result in incorrect behavior. If you need to write
    to the tensor, please clone it first.

See :class:`torch.nn.Unfold` for details
"""

Hope it is useful for you

Mathilda88 commented 3 years ago

Thanks Stanly, but I was looking for what by this function you were looking into!!

Mathilda88 commented 3 years ago

In other words, If this unfolding is considered whenever we want to implement a convolution between two feature maps?

ygjwd12345 commented 3 years ago

it just for the AGD module.