Question about figure3 in the paper

Hello!

Gradient norm is an intriguing and somewhat mysterious attribute in deep learning. Defined as the magnitude of the gradient, it can be interpreted as a metric indicating the overall curvature of the loss surface. As far as I know, explicitly elucidating the correlation between gradient norm and model generalization remains challenging, particularly in practical deep learning applications.

In some sense, you may regard the gradient norm penalty as analogous to the weight penalty commonly used in deep learning. Typically, we expect to confine our search space within a narrow weight region to facilitate a more rapid convergence during the search process. However, it is imperative to avoid excessive weight penalties, as they may lead to suboptimal minima that do not meet the requirements of the task. For the explicitly penalizing gradient norm in our paper, we aim to converge towards minima with a flatter loss landscape. Conversely, an over-regularization can potentially steer the training towards bad minima, where the focus is disproportionately on the gradient norm of the loss surface, overshadowing the task requirements. So, we should set an appropriate regularization effect during the training. Through empirical observations, we have found that setting α=0.8 yields optimal performance for our training scenarios.

Hope this can give you some intuition.

zhaoyang-0204 / gnp

Question about figure3 in the paper #1