Closed YangJae96 closed 2 weeks ago
Hi @YangJae96,
In the above code, the training of layer visual.proj
is controlled by the finetune_proj
parameter, while all other layers are controlled by the else
part, including the vision and text tower. In both vision and text encoder, SPU proposes to selective update the first MLP layer.
For the dataset, we assume we have access to the class name since we are exploring update the CLIP model.
Hope these help.
@wx-zhang Hi.
Just a simple question.
Why is the image size resized to 224x224 for CIFAR100?
If I remember, CIFAR 100 image size is 32x32.
Is it because training on 32x32 data degrades the control set performance more because ImageNet is 224x224?
Hi @YangJae96 ,
This is because the pre-trained patch embedding works on an input size of 224.
Hello. Thank you for your great work!
Is only projection layer of the visual encoder of CLIP only updated?
In the above figure, I can see only finetune_proj layer is set to trainable_params = True.
So, only the first layer of the MLP blocks in text encoder is updated?!
Also, does the datasets you used in continual learning sequence have text labels? I thought only image classification loss was used but the compute loss section in you code calculates the text-to-img cross entropy loss.
Thanks in advance.