ziqipang / LM4VisualEncoding

[ICLR 2024 (Spotlight)] "Frozen Transformers in Language Models are Effective Visual Encoder Layers"
https://arxiv.org/abs/2310.12973
MIT License
224 stars 7 forks source link

Influence of ViT #11

Open jiazhen-code opened 1 month ago

jiazhen-code commented 1 month ago

Thank you for your insightful discovery. I have a question regarding the influence of ViT. If you use a pre-trained ViT and freeze it, then only train the added adapter layer while also freezing the LLaMA block, will the performance consistently improve?

Additionally, would using multimodal-aligned LLMs like LLaMA in LLaVA achieve better performance compared to the original LLaMA? I find it fascinating to explore these aspects as they could provide clearer guidance on utilizing LLM-blocks in vision components.

ziqipang commented 2 weeks ago

@jiazhen-code Thanks for the insightful questions! These are all very valuable points, especially one year after we did this work. However, I haven't tried these when writing this paper, but I would like to see any results on the points you mentioned.