ziqipang / LM4VisualEncoding

[ICLR 2024 (Spotlight)] "Frozen Transformers in Language Models are Effective Visual Encoder Layers"
https://arxiv.org/abs/2310.12973
MIT License
219 stars 6 forks source link

Sharing experiments in lung sound abnormal detection, and suggestion, add random initialization weights of LLM layer experiments. #7

Closed QiaoranC closed 9 months ago

QiaoranC commented 9 months ago

Hi, thank you for sharing this interesting discovery. I recommend utilizing the LLM layer structure while maintaining random initialization for a more effective conversational ablation study.

In my research, I've adopted your methods and incorporated them into my backbone Transformer. My focus is on a specific task - detecting abnormalities in lung sounds. My model is closely related to AST Whisper, utilizing STFT input and incorporating multiple Transformer blocks.

In the experiments, my dataset consists of approximately 2000-5000 lung sound records with well-annotated, high-quality labels. It involves 2-4 multi-label classifications and presents certain challenges, making it more complex than the typical 99% accuracy datasets. I initiated the experiments with a 31-layer LLaMA 7B (including the 8th layer), exploring different LLM models such as LLaMA 2 7B, LLaMA2 13B, and phi 2 (currently experimenting). Although there was a modest but consistent improvement of +1 to 3% in F1 score with LLaMA 7B, it wasn't universally observed across all classifications and didn't occur with LLaMA2 and phi2 (which might raise another question). This aligns with your findings.

To assess the impact of LLM weights, I conducted additional experiments with the random initialization of LLaMA 7B structure (excluding weight import but keeping them frozen). Surprisingly, there were still improvements. Before delving into the reasons, I'd like to suggest you and others replicate this experiment. Could the observed effects be due to the specificity of my narrow mission and limited records? Is it conceivable that this phenomenon would occur in a larger and more complex image dataset?