Closed neerajanand321 closed 4 months ago
Thank you for pointing that out. You are correct; there is an error in using self.w_gate instead of self.w_noise. We will correct this mistake promptly. Your feedback is greatly appreciated.
Welcome! Just wanted to confirm are your results based on this code i.e., using w_gate instead of w_noise?
I will retrain a 7B model to verify the performance. In my opinion, the noise parameter may not significantly impact the results since we have only two experts, and the final results did not show over-reliance on either expert, indicating stable training.
Got it, Thank you very much.
if self.training: raw_noise_stddev = x @ self.w_gate
, Here instead of self.w_gate there should be self.w_noise