yixuan / LBFGSpp

A header-only C++ library for L-BFGS and L-BFGS-B algorithms
https://lbfgspp.statr.me/
MIT License
532 stars 101 forks source link

Trouble converging #7

Closed Svalorzen closed 4 years ago

Svalorzen commented 4 years ago

Hi, I'm trying to estimate the parameters of a GP with a standard RBF kernel. I'm having some trouble converging to the same parameters I get when implementing the code in Python. I am using the current master branch (not sure if I should be using a particular commit).

With LBFGSSolver the search seems to never end, going in loops. With LBFGSBSolver, after around ~20 iterations the search ends with a thrown exception. I've tried modifying the parameters to make it look longer, but it never really converges more than what it already achieved.

LBFGSBSolver gets somewhat close to the optimal solution found via Python (1.3455902517905933 0.7625631641671134 with the negative log likelihood at 5.8842014), but since it then throws I don't want to simply use a try-catch if it's not stopping cleanly (since maybe it could actually get to the same solution as in Python).

I've double checked my gradients, and I hope I haven't made silly mistakes. Could you help me figure out if there's something I should be doing differently? Here is my code, if it helps:

#include <iostream>
#include <Eigen/Eigenvalues>

#include <LBFGSB.h>
#include <LBFGS.h>

constexpr double PI = 3.141592653589793238462643383279502884197169399375105820974944;

using Matrix2D = Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor | Eigen::AutoAlign>;
using Vector = Eigen::Matrix<double, Eigen::Dynamic, 1>;

Matrix2D makeDist(const Matrix2D & x1, const Matrix2D & x2) {
    return ((-2.0 * (x1 * x2.transpose())).colwise() + x1.rowwise().squaredNorm()).rowwise() + x2.rowwise().squaredNorm().transpose();
}

Matrix2D kernel(const Matrix2D & x1, const Matrix2D & x2, double l = 1.0, double sigma = 1.0) {
    auto dist = makeDist(x1, x2);
    return std::pow(sigma, 2) * ((-0.5 / std::pow(l, 2)) * dist).array().exp();
}

struct Likelihood {
    Likelihood(const Matrix2D & xData, const Vector & muData, double noise) :
            xData_(xData), muData_(muData), noise_(noise) {}

    const Matrix2D & xData_;
    const Vector & muData_;
    double noise_;

    double operator()(const Eigen::VectorXd& x, Eigen::VectorXd& grad) {
        const auto l = x[0];
        const auto sigma = x[1];

        auto k = kernel(xData_, xData_, l, sigma);
        k.diagonal().array() += std::pow(noise_, 2);

        auto kinv = k.inverse();

        // Compute derivatives as per
        // https://math.stackexchange.com/questions/1030534/gradients-of-marginal-likelihood-of-gaussian-process-with-squared-exponential-co
        Matrix2D dkl = k.array() * (makeDist(xData_, xData_) / std::pow(l, 3)).array();
        auto dks = 2 / sigma * k;

        grad[0] = 0.5 * muData_.transpose() * kinv * dkl * kinv * muData_ - 0.5 * (kinv * dkl).trace();
        grad[1] = 0.5 * muData_.transpose() * kinv * dks * kinv * muData_ - 0.5 * (kinv * dks).trace();
        // Minimize negative log-likelihood
        grad = -grad;

        // Compute log likelihood (correct)
        auto L = k.llt();
        Vector alpha = L.solve(muData_);

        const double logLikelihood = -0.5 * muData_.transpose() * alpha - L.matrixL().selfadjointView().diagonal().array().log().sum() - 0.5 * muData_.size() * std::log(2.0 * PI);

        // Print current parameters
        std::cout << l << ", " << sigma << " ==> " << -logLikelihood << "  -  " << grad.transpose() << std::endl;

        // Return negative log-likelihood
        return -logLikelihood;
    }
};

int main() {
    const double noise = 0.4;

    // Some test points to train on.
    Matrix2D X_train(6, 1);
    X_train << -5, -4, -3, -2, -1, 1;

    Vector Y_train = X_train.array().sin().array();

    // Optimizer params
    LBFGSpp::LBFGSBParam<double> param;
    param.epsilon = 1e-5;
    param.max_iterations = 100;

    param.m = 10;
    param.ftol = 0.4;

    LBFGSpp::LBFGSParam<double> paramx;
    param.epsilon = 1e-5;
    param.max_iterations = 100;

    // Bounds
    Vector lb = Vector::Constant(2, 1e-8);
    Vector ub = Vector::Constant(2, std::numeric_limits<double>::infinity());
    // Initial guess
    Vector x = Vector::Constant(2, 1.0);

    LBFGSpp::LBFGSBSolver<double> solver(param);
    LBFGSpp::LBFGSSolver<double> solver2(paramx);

    // Log-likelihood function with set data/noise
    Likelihood like(X_train, Y_train, noise);

    double log;
    // solver.minimize(like, x, log, lb, ub);
    solver2.minimize(like, x, log);

    return 0;
}
yixuan commented 4 years ago

Thanks for reporting. I'll take a look today.

yixuan commented 4 years ago

The reason is that the gradient was incorrectly calculated.

Since you have added a diagonal noise to K, the derivatives of l and sigma need to be computed before K is added the noise.

Other tweaks:

  1. You should cache makeDist(xData_, xData_) outside kernel() and operator(), since it stays the same.
  2. Also cache the Cholesky decomposition of K to compute other related quantities, for example K^(-1), K^(-1) * y, and log|K|.
  3. Avoid direct matrix-matrix multiplication: instead of y.transpose() * kinv * dkl * kinv * y, do
    alpha = kinv * y
    alpha.transpose() * dkl * alpha
  4. tr(A * B) = sum(A .* B), where .* is the elementwise multiplication. The former is O(n^3), whereas the latter is O(n^2).

Below is my version:

#include <iostream>
#include <Eigen/Core>
#include <Eigen/Cholesky>

#include <LBFGSB.h>
#include <LBFGS.h>

constexpr double PI = 3.141592653589793238462643383279502884197169399375105820974944;

using Matrix2D = Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor | Eigen::AutoAlign>;
using Vector = Eigen::Matrix<double, Eigen::Dynamic, 1>;

Matrix2D makeDist(const Matrix2D & x1, const Matrix2D & x2) {
    return ((-2.0 * (x1 * x2.transpose())).colwise() + x1.rowwise().squaredNorm()).rowwise() +
            x2.rowwise().squaredNorm().transpose();
}

Matrix2D kernel(const Matrix2D & dist, double l = 1.0, double sigma = 1.0) {
    return std::pow(sigma, 2) * ((-0.5 / std::pow(l, 2)) * dist).array().exp();
}

struct Likelihood {
    Likelihood(const Matrix2D & xData, const Vector & muData, double noise) :
            dist_(makeDist(xData, xData)), muData_(muData), noise_(noise) {}

    const Matrix2D dist_;
    const Vector & muData_;
    double noise_;

    double operator()(const Eigen::VectorXd& x, Eigen::VectorXd& grad) {
        const int N = muData_.size();
        const double l = x[0];
        const double sigma = x[1];

        Matrix2D k = kernel(dist_, l, sigma);

        // Compute derivatives as per
        // https://math.stackexchange.com/questions/1030534/gradients-of-marginal-likelihood-of-gaussian-process-with-squared-exponential-co
        Matrix2D dkl = k.cwiseProduct(dist_) / std::pow(l, 3);
        Matrix2D dks = (2.0 / sigma) * k;

        k.diagonal().array() += std::pow(noise_, 2);
        auto llt = Eigen::LLT<Matrix2D>(k);
        Matrix2D kinv = llt.solve(Matrix2D::Identity(N, N));
        Vector alpha = llt.solve(muData_);

        grad[0] = 0.5 * alpha.transpose() * dkl * alpha - 0.5 * kinv.cwiseProduct(dkl).sum();
        grad[1] = 0.5 * alpha.transpose() * dks * alpha - 0.5 * kinv.cwiseProduct(dks).sum();
        // Minimize negative log-likelihood
        grad = -grad;

        // Compute log likelihood (correct)
        const double logLikelihood = -0.5 * muData_.dot(alpha) - llt.matrixL().selfadjointView().diagonal().array().log().sum() - 0.5 * N * std::log(2.0 * PI);

        // Print current parameters
        std::cout << l << ", " << sigma << " ==> " << -logLikelihood << "  -  " << grad.transpose() << std::endl;

        // Return negative log-likelihood
        return -logLikelihood;
    }
};

int main() {
    const double noise = 0.4;

    // Some test points to train on.
    Matrix2D X_train(6, 1);
    X_train << -5, -4, -3, -2, -1, 1;
    Vector Y_train = X_train.array().sin();

    // Log-likelihood function with set data/noise
    Likelihood like(X_train, Y_train, noise);

    // Bounds
    Vector lb = Vector::Constant(2, 1e-8);
    Vector ub = Vector::Constant(2, std::numeric_limits<double>::infinity());
    // Initial guess
    Vector x = Vector::Constant(2, 1.0);

    // L-BFGS-B
    LBFGSpp::LBFGSBParam<double> param;
    param.epsilon = 1e-5;
    param.max_iterations = 100;
    LBFGSpp::LBFGSBSolver<double> solver(param);
    double log;
    solver.minimize(like, x, log, lb, ub);

    std::cout << "\n===============================\n\n";

    LBFGSpp::LBFGSParam<double> paramx;
    paramx.epsilon = 1e-5;
    paramx.max_iterations = 10;
    LBFGSpp::LBFGSSolver<double> solver2(paramx);
    x = Vector::Constant(2, 1.0);
    solver2.minimize(like, x, log);

    return 0;
}

Output:

1, 1 ==> 6.49954  -  -1.64172  2.33992
1.85404, 0.479791 ==> 7.35173  -   2.13605 -6.82365
1.3421, 0.791619 ==> 5.88876  -  -0.101547  0.295286
1.36368, 0.760786 ==> 5.88481  -   0.0601469 -0.0763338
1.35529, 0.76696 ==> 5.88431  -  0.0152388 0.0146282
1.35236, 0.76593 ==> 5.88426  -  0.00972153  0.0133352
1.34553, 0.762505 ==> 5.8842  -  -5.29077e-06 -0.000405903
1.34559, 0.762563 ==> 5.8842  -   1.4304e-06 1.52874e-06

===============================

1, 1 ==> 6.49954  -  -1.64172  2.33992
1.57435, 0.181389 ==> 10.1205  -  1.09806 -14.218
1.28717, 0.590695 ==> 6.05536  -   0.38649 -2.43346
1.21589, 0.792321 ==> 5.92615  -  -0.488156  0.681163
1.25462, 0.747928 ==> 5.89378  -  -0.233815   0.12984
1.28602, 0.735536 ==> 5.88844  -  -0.0968389  -0.107242
1.3083, 0.741189 ==> 5.88623  -  -0.045728 -0.113439
1.34315, 0.760708 ==> 5.88421  -  -0.00147428  -0.0116414
1.34548, 0.76243 ==> 5.8842  -  0.000102842 -0.00104711
1.3456, 0.762561 ==> 5.8842  -     2.875e-05 -4.39889e-05
1.34559, 0.762563 ==> 5.8842  -   3.86423e-06 -3.34703e-06
Svalorzen commented 4 years ago

Sorry for my mistake, I hope you haven't lost too much time on it. Thanks a ton for the help!

yixuan commented 4 years ago

Not a big deal. It was actually a good time to refresh my memory on GP fitting. :joy: