Closed hitvoice closed 4 years ago
Sorry I didn't notice your post. It shouldn't. I modified the code to finish the ascent steps in the forward function, so the K gradient ascent step will be executed for each minibatch while accumulating gradients (the FreeLB algorithm), before we execute the gradient ascent steps on the next minibatch.
I used gradient accumulation most of the time.
I see. Thanks for your reply!
In fairseq implementation, the "update_freq" configuration (from the original fairseq code) specifies how often the optimizer updates model parameters. when update_freq > 1, it will accumulate gradients and halt gradient synchronization until the last step. In adversarial training, is gradient synchronization needed in gradient computation? If so, does it mean that setting update_freq > 1 will make the computation incorrect?