yixuan / LBFGSpp

A header-only C++ library for L-BFGS and L-BFGS-B algorithms
https://lbfgspp.statr.me/
MIT License
532 stars 101 forks source link

Why the stop criteria is like Gradient Norm< epsilon_relative * x.norm() #34

Closed CJCHEN1230 closed 1 year ago

CJCHEN1230 commented 1 year ago

First, thanks author for sharing such a great source code. Recently, I learn the LBFG and traced the code of this library. But, I really don't know why one of the stop condition is like: m_gnorm <= m_param.epsilon_rel * x.norm() Why the norm of gradient is associated with the norm of parameter? Could anyone provide the keyword or explain it? Thank you.

WhateverLiu commented 1 year ago

I guess the rationale can be traced back in plain gradient descent $xn \gets x{n - 1} - rf^{\prime}(x{n -1})$. If $f^{\prime}(x{n -1})$ is negligibly small relative to $x_{n - 1}$, the update will not be meaningful and we declare convergence

jiyess commented 1 year ago

This stopping condition checks for relative convergence of the objective function. The convergence of an optimization algorithm can be determined in a number of ways, and it's common to use the norm of the gradient (gnorm) as an indicator. When this norm is sufficiently small, we say that the algorithm has converged, i.e., it has found a minimum.

However, sometimes, we can't use an absolute value for this "small enough" norm because the solution might be large, and hence even at the solution, the gradient might not be small. To counter this, we use a relative measure instead of an absolute one.

In the equation m_gnorm <= m_param.epsilon_rel * x.norm(), the m_gnorm is the norm of the gradient, m_param.epsilon_relis a small positive number indicating the relative tolerance, and x.norm() is the norm of the current point in the parameter space.

By scaling the tolerance with the norm of x, we're saying that we consider the algorithm to be converged when the gradient is small relative to the current point x. This allows the stopping condition to adjust dynamically based on the magnitude of the parameters.

The concept of this type of convergence is sometimes referred to as relative convergence or scaled convergence in the literature on optimization.

CJCHEN1230 commented 1 year ago

@jiyess Thanks for your detailed reply. As my understanding, I thought the relatvie converge is something like this: ∥f(x(k))∥≤δ∥f(x(k−1))∥+ϵ or ∥x(k)−x(k−1)∥≤δ∥x(k)∥

But setting the gradient norm relative to its position norm, as the stopping criterion seems to be case dependent.

For example: image

The difference between these two cases is only a shift, but the case1 would be more easier to trigger the stop conditon. Does it make sense?

WhateverLiu commented 1 year ago

@CJCHEN1230 I think you are right about cases I and II. This is why standardization/normalization of data is encouraged if possible before optimization

jiyess commented 1 year ago

@CJCHEN1230 Indeed, your interpretation is correct. In numerical optimization, the context and specific use case heavily influence what is considered an "acceptable" error. Significant digits serve as a tool to adjust this threshold based on the magnitude of the variables involved.

In Case I, where the parameter values are on the order of thousands, an error of 1.0 might be completely acceptable given the overall context. On the other hand, in Case II, where the parameter values are much smaller (around 5), an error of 1.0 would represent a much larger proportion of the parameter value, and hence would be considered significant and potentially problematic.

Hence, by comparing the gradient norm relative to the parameter norm, we are essentially normalizing the stopping criterion to account for the scale of the problem. This provides a more flexible and context-dependent criterion for determining when the algorithm has sufficiently converged.

However, the condition m_gnorm <= m_param.epsilon_rel * x.norm() is just one of the many possible ways to determine convergence. It suits many scenarios well because it scales with the magnitude of the problem. However, other criteria might be more suitable for different contexts or specific applications.

So, feel free to adapt the stopping criteria to best suit your needs in your particular application. And remember, the ultimate goal is to find a solution that is "good enough" for your purposes, given the resources (like time and computing power) that you have available. Happy optimizing! ; )

CJCHEN1230 commented 1 year ago

@jiyess At the beginning, I thought the norm of gradient is nothing related to the norm of parameter. So, I can’t understand why this condition can be one of the general stop condition. But now, to my understanding, the norm of the gradient will be much smaller than its beginning, but the norm of the parameter has no certain direction of change. The norm of the parameter will not change as much as the norm of the gradient. And that’s why can set a proper epsilon_rel to decide when to stop the optimization. Am I correct?

jiyess commented 1 year ago

@CJCHEN1230 Exactly! ; ) In the current implementation, the stopping criterion m_gnorm <= \max{\epsilon_{abs}, m_param.epsilon_rel * x.norm()}. This is implemented to handle scenarios where x.norm() might be too small. However, a potentially superior approach would be to use m_gnorm <= \max{\epsilon_{abs}, m_param.epsilon_rel * x_0.norm()}. In this case, x_0 represents the initial guess. This modification would base the stopping criterion on the magnitude of the initial guess.

Essentially, this relative is to reduce the effect of parameter norm or size. Hope it helps!

CJCHEN1230 commented 1 year ago

@jiyess I totally understand it. Really appreciate your patience in explaining.

yixuan commented 1 year ago

Hi all, thanks for the insightful discussions and I apologize that I was too busy to join the conversation.

I remember this parameter was adapted from the libLBFGS library that inspired the development of LBFGS++. As @jiyess mentioned, users have the freedom to set tolerance parameters, and the overall stopping criterion is $$\Vert g \Vert\le \max\{\varepsilon{abs},\varepsilon{rel}\Vert x \Vert\},$$ so by setting $\varepsilon_{rel}=0$ you still get the absolute tolerance.

@WhateverLiu 's intuition is also useful. $\Vert gk \Vert$ in some sense measures the difference $\Vert x{k+1}-x_k \Vert$, so $\Vert gk \Vert\le\varepsilon{rel}\Vert xk \Vert$ can be viewed as an approximation to the condition $\Vert x{k+1}-xk \Vert\le\varepsilon{rel}\Vert x_k \Vert$, which corresponds to @CJCHEN1230 's relative convergence criterion.