wx-zhang / spu

1 stars 0 forks source link

About visual encoder update #2

Closed YangJae96 closed 2 weeks ago

YangJae96 commented 2 weeks ago

Hello. Thank you for your great work!

Is only projection layer of the visual encoder of CLIP only updated?

image

In the above figure, I can see only finetune_proj layer is set to trainable_params = True.

So, only the first layer of the MLP blocks in text encoder is updated?!

Also, does the datasets you used in continual learning sequence have text labels? I thought only image classification loss was used but the compute loss section in you code calculates the text-to-img cross entropy loss.

Thanks in advance.

wx-zhang commented 2 weeks ago

Hi @YangJae96,

In the above code, the training of layer visual.proj is controlled by the finetune_proj parameter, while all other layers are controlled by the else part, including the vision and text tower. In both vision and text encoder, SPU proposes to selective update the first MLP layer.

For the dataset, we assume we have access to the class name since we are exploring update the CLIP model.

Hope these help.

YangJae96 commented 1 week ago

@wx-zhang Hi.

Just a simple question.

Why is the image size resized to 224x224 for CIFAR100?

If I remember, CIFAR 100 image size is 32x32.

Is it because training on 32x32 data degrades the control set performance more because ImageNet is 224x224?

wx-zhang commented 1 week ago

Hi @YangJae96 ,

This is because the pre-trained patch embedding works on an input size of 224.