Bug in noisy mixture of experts

yfzhang114 / SliME

✨✨Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Apache License 2.0

140 stars 7 forks source link

Bug in noisy mixture of experts #3

Closed neerajanand321 closed 4 months ago

neerajanand321 commented 4 months ago

if self.training: raw_noise_stddev = x @ self.w_gate, Here instead of self.w_gate there should be self.w_noise

yfzhang114 commented 4 months ago

Thank you for pointing that out. You are correct; there is an error in using self.w_gate instead of self.w_noise. We will correct this mistake promptly. Your feedback is greatly appreciated.

neerajanand321 commented 4 months ago

Welcome! Just wanted to confirm are your results based on this code i.e., using w_gate instead of w_noise?

yfzhang114 commented 4 months ago

I will retrain a 7B model to verify the performance. In my opinion, the noise parameter may not significantly impact the results since we have only two experts, and the final results did not show over-reliance on either expert, indicating stable training.

neerajanand321 commented 4 months ago

Got it, Thank you very much.