yitu-opensource / T2T-ViT

ICCV2021, Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet
Other
1.15k stars 176 forks source link

Questions about feature visualization of vit_large_patch16_384 #33

Closed Hongyu-He closed 3 years ago

Hongyu-He commented 3 years ago

For Figure 2. in the paper, I tried to plot the feature visualization of T2T-ViT-24 trained on ImageNet using the code provided in visualization_vit.ipynb and the same input image “dog.png”. The input image was resized to (1024, 1024), and I found the feature maps have the size of (64, 64). However, the plotted feature maps look very different from those in your paper. The following figure is my feature maps from T2T-ViT-24 block 1:

layer_0

There are lots of noises in my feature maps and the low-level structure features such as edges and lines are not clear. I’m not sure what caused the discrepancy. Also, the resolution of feature maps in the paper looks higher that 64*64. Could you please provide more instructions on feature visualization of this model? That would help me understand your work better! Thank you in advance!

yuanli2333 commented 3 years ago

Hi, you should resize the input image to (2048,2048) or even larger to obtain higher resolution feature maps. Now the feature maps with size of (64,64) cannot give clear information.

HubHop commented 3 years ago

@yuanli2333 Hi, how do you handle the positional embedding with larger images when you do visualization?

yuanli2333 commented 3 years ago

@yuanli2333 Hi, how do you handle the positional embedding with larger images when you do visualization?

Hi,

You can interpolate the position embedding for different image size with the function here.

Or directly use T2T-ViT as the way in the usage, we already put the interpolation function in the function of 'load_for_transfer_learning'.