wusize / CLIPSelf

[ICLR2024 Spotlight] Code Release of CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
https://arxiv.org/abs/2310.01403
Other
161 stars 9 forks source link

performance of Zero-shot Classification and Zero-shot Cross-Modal Retrieval #11

Open hamigualisingl opened 7 months ago

hamigualisingl commented 7 months ago

How does the performance of this fine-tuned model on Zero-shot Classification and Zero-shot Cross-Modal Retrieval?

wusize commented 7 months ago

Hi! Thanks for your question! The self distillation hurts performance on image recognition tasks indeed. Add a loss to align the image representation of the student and the teacher models can alleviate this degradation. But we did not add this to our paper as we focused on dense prediction tasks.

hamigualisingl commented 7 months ago

感谢您的答复,我目前是做clip预训练的,也是希望能够让clip兼具俩种能力,全局和局部,我按照您论文里面的方案,在yffc15m训练一个vit-32的模型,发现在图文检索的表现不如baseline,所以就想前来请教一下。

---Original--- From: "Size Wu @.> Date: Sat, Mar 2, 2024 15:19 PM To: @.>; Cc: @.**@.>; Subject: Re: [wusize/CLIPSelf] performance of Zero-shot Classification and Zero-shot Cross-Modal Retrieval (Issue #11)

Hi! Thanks for your question! The self distillation hurts performance on image recognition tasks indeed. Add a loss to align the image representation of the student and the teacher models can alleviate this degradation. But we did not add this to our paper as we focused on dense prediction tasks.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

hamigualisingl commented 7 months ago

想咨询下您,微调后的模型,在我说的那俩个任务表现情况具体是多少呢

---Original--- From: "Size Wu @.> Date: Sat, Mar 2, 2024 15:19 PM To: @.>; Cc: @.**@.>; Subject: Re: [wusize/CLIPSelf] performance of Zero-shot Classification and Zero-shot Cross-Modal Retrieval (Issue #11)

Hi! Thanks for your question! The self distillation hurts performance on image recognition tasks indeed. Add a loss to align the image representation of the student and the teacher models can alleviate this degradation. But we did not add this to our paper as we focused on dense prediction tasks.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

wusize commented 7 months ago

想咨询下您,微调后的模型,在我说的那俩个任务表现情况具体是多少呢 ---Original--- From: "Size Wu @.> Date: Sat, Mar 2, 2024 15:19 PM To: @.>; Cc: @.**@.>; Subject: Re: [wusize/CLIPSelf] performance of Zero-shot Classification and Zero-shot Cross-Modal Retrieval (Issue #11) Hi! Thanks for your question! The self distillation hurts performance on image recognition tasks indeed. Add a loss to align the image representation of the student and the teacher models can alleviate this degradation. But we did not add this to our paper as we focused on dense prediction tasks. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

I remembered there is a 3~4 decrease in the top1 classification accuracy on imagenet.

hamigualisingl commented 7 months ago

我是在yffcm跑的clip,这个数据集跑出的结果都比较差,vit-b-32,微调后.掉了一半

---Original--- From: "Size Wu @.> Date: Sun, Mar 3, 2024 18:40 PM To: @.>; Cc: @.**@.>; Subject: Re: [wusize/CLIPSelf] performance of Zero-shot Classification and Zero-shot Cross-Modal Retrieval (Issue #11)

想咨询下您,微调后的模型,在我说的那俩个任务表现情况具体是多少呢 … ---Original--- From: "Size Wu @.> Date: Sat, Mar 2, 2024 15:19 PM To: @.>; Cc: @.@.>; Subject: Re: [wusize/CLIPSelf] performance of Zero-shot Classification and Zero-shot Cross-Modal Retrieval (Issue #11) Hi! Thanks for your question! The self distillation hurts performance on image recognition tasks indeed. Add a loss to align the image representation of the student and the teacher models can alleviate this degradation. But we did not add this to our paper as we focused on dense prediction tasks. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

I remembered there is a 3~4 decrease in the top1 classification accuracy on imagenet.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

hamigualisingl commented 7 months ago

多谢,可能是我参数设置不对

---Original--- From: "Size Wu @.> Date: Sun, Mar 3, 2024 18:40 PM To: @.>; Cc: @.**@.>; Subject: Re: [wusize/CLIPSelf] performance of Zero-shot Classification and Zero-shot Cross-Modal Retrieval (Issue #11)

想咨询下您,微调后的模型,在我说的那俩个任务表现情况具体是多少呢 … ---Original--- From: "Size Wu @.> Date: Sat, Mar 2, 2024 15:19 PM To: @.>; Cc: @.@.>; Subject: Re: [wusize/CLIPSelf] performance of Zero-shot Classification and Zero-shot Cross-Modal Retrieval (Issue #11) Hi! Thanks for your question! The self distillation hurts performance on image recognition tasks indeed. Add a loss to align the image representation of the student and the teacher models can alleviate this degradation. But we did not add this to our paper as we focused on dense prediction tasks. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

I remembered there is a 3~4 decrease in the top1 classification accuracy on imagenet.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

wusize commented 7 months ago

感谢您的答复,我目前是做clip预训练的,也是希望能够让clip兼具俩种能力,全局和局部,我按照您论文里面的方案,在yffc15m训练一个vit-32的模型,发现在图文检索的表现不如baseline,所以就想前来请教一下。 ---Original--- From: "Size Wu @.> Date: Sat, Mar 2, 2024 15:19 PM To: @.>; Cc: @.**@.>; Subject: Re: [wusize/CLIPSelf] performance of Zero-shot Classification and Zero-shot Cross-Modal Retrieval (Issue #11) Hi! Thanks for your question! The self distillation hurts performance on image recognition tasks indeed. Add a loss to align the image representation of the student and the teacher models can alleviate this degradation. But we did not add this to our paper as we focused on dense prediction tasks. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Since you are doing pretraining, I guess you could take a glimpse of this paper. Like PACL, you could try replacing the cls token pooling with a global pooling at the last layer.

wusize commented 7 months ago

感谢您的答复,我目前是做clip预训练的,也是希望能够让clip兼具俩种能力,全局和局部,我按照您论文里面的方案,在yffc15m训练一个vit-32的模型,发现在图文检索的表现不如baseline,所以就想前来请教一下。 ---Original--- From: "Size Wu @.**> Date: Sat, Mar 2, 2024 15:19 PM To: @.**>; Cc: @.**@.**>; Subject: Re: [wusize/CLIPSelf] performance of Zero-shot Classification and Zero-shot Cross-Modal Retrieval (Issue #11) Hi! Thanks for your question! The self distillation hurts performance on image recognition tasks indeed. Add a loss to align the image representation of the student and the teacher models can alleviate this degradation. But we did not add this to our paper as we focused on dense prediction tasks. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Since you are doing pretraining, I guess you could take a glimpse of this paper. Like PACL, you could try replacing the cls token pooling with a global pooling at the last layer so that the feature extractions for regional and global representations are the same.

hamigualisingl commented 7 months ago

感谢

---Original--- From: "Size Wu @.> Date: Sun, Mar 3, 2024 18:57 PM To: @.>; Cc: @.**@.>; Subject: Re: [wusize/CLIPSelf] performance of Zero-shot Classification and Zero-shot Cross-Modal Retrieval (Issue #11)

感谢您的答复,我目前是做clip预训练的,也是希望能够让clip兼具俩种能力,全局和局部,我按照您论文里面的方案,在yffc15m训练一个vit-32的模型,发现在图文检索的表现不如baseline,所以就想前来请教一下。 … ---Original--- From: "Size Wu @.> Date: Sat, Mar 2, 2024 15:19 PM To: @.>; Cc: @.@.>; Subject: Re: [wusize/CLIPSelf] performance of Zero-shot Classification and Zero-shot Cross-Modal Retrieval (Issue #11) Hi! Thanks for your question! The self distillation hurts performance on image recognition tasks indeed. Add a loss to align the image representation of the student and the teacher models can alleviate this degradation. But we did not add this to our paper as we focused on dense prediction tasks. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Since you are doing pretraining, I guess you could take a glimpse of this paper. Like PACL, you could try replacing the cls token pooling with a global pooling at the last layer.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>