Hi, thank you for sharing this interesting discovery. I recommend utilizing the LLM layer structure while maintaining random initialization for a more effective conversational ablation study.
In my research, I've adopted your methods and incorporated them into my backbone Transformer. My focus is on a specific task - detecting abnormalities in lung sounds. My model is closely related to AST Whisper, utilizing STFT input and incorporating multiple Transformer blocks.
In the experiments, my dataset consists of approximately 2000-5000 lung sound records with well-annotated, high-quality labels. It involves 2-4 multi-label classifications and presents certain challenges, making it more complex than the typical 99% accuracy datasets. I initiated the experiments with a 31-layer LLaMA 7B (including the 8th layer), exploring different LLM models such as LLaMA 2 7B, LLaMA2 13B, and phi 2 (currently experimenting). Although there was a modest but consistent improvement of +1 to 3% in F1 score with LLaMA 7B, it wasn't universally observed across all classifications and didn't occur with LLaMA2 and phi2 (which might raise another question). This aligns with your findings.
To assess the impact of LLM weights, I conducted additional experiments with the random initialization of LLaMA 7B structure (excluding weight import but keeping them frozen). Surprisingly, there were still improvements. Before delving into the reasons, I'd like to suggest you and others replicate this experiment. Could the observed effects be due to the specificity of my narrow mission and limited records? Is it conceivable that this phenomenon would occur in a larger and more complex image dataset?
Hi, thank you for sharing this interesting discovery. I recommend utilizing the LLM layer structure while maintaining random initialization for a more effective conversational ablation study.
In my research, I've adopted your methods and incorporated them into my backbone Transformer. My focus is on a specific task - detecting abnormalities in lung sounds. My model is closely related to AST Whisper, utilizing STFT input and incorporating multiple Transformer blocks.
In the experiments, my dataset consists of approximately 2000-5000 lung sound records with well-annotated, high-quality labels. It involves 2-4 multi-label classifications and presents certain challenges, making it more complex than the typical 99% accuracy datasets. I initiated the experiments with a 31-layer LLaMA 7B (including the 8th layer), exploring different LLM models such as LLaMA 2 7B, LLaMA2 13B, and phi 2 (currently experimenting). Although there was a modest but consistent improvement of +1 to 3% in F1 score with LLaMA 7B, it wasn't universally observed across all classifications and didn't occur with LLaMA2 and phi2 (which might raise another question). This aligns with your findings.
To assess the impact of LLM weights, I conducted additional experiments with the random initialization of LLaMA 7B structure (excluding weight import but keeping them frozen). Surprisingly, there were still improvements. Before delving into the reasons, I'd like to suggest you and others replicate this experiment. Could the observed effects be due to the specificity of my narrow mission and limited records? Is it conceivable that this phenomenon would occur in a larger and more complex image dataset?