zhuchen03 / FreeLB

Adversarial Training for Natural Language Understanding
250 stars 41 forks source link

Is it still working with update_freq > 1? #4

Closed hitvoice closed 4 years ago

hitvoice commented 4 years ago

In fairseq implementation, the "update_freq" configuration (from the original fairseq code) specifies how often the optimizer updates model parameters. when update_freq > 1, it will accumulate gradients and halt gradient synchronization until the last step. In adversarial training, is gradient synchronization needed in gradient computation? If so, does it mean that setting update_freq > 1 will make the computation incorrect?

zhuchen03 commented 4 years ago

Sorry I didn't notice your post. It shouldn't. I modified the code to finish the ascent steps in the forward function, so the K gradient ascent step will be executed for each minibatch while accumulating gradients (the FreeLB algorithm), before we execute the gradient ascent steps on the next minibatch.

I used gradient accumulation most of the time.

hitvoice commented 4 years ago

I see. Thanks for your reply!