LogisticGroupLasso is not running stably and seems inferior to scikit-learn's LogisticRegression

normanius commented 4 years ago

Hi Yngve

I have been experimenting a bit with LogisticGroupLasso (for binary classification) and found it not to work super reliably, yet. LogisticGroupLasso, from what I have understood, is a recently added extension of the core functionality: sparse group lasso for regression. Is it possible that the logistic regression variant suffers from some early illnesses, still?

To demonstrate what I mean, I compared LogisticGroupLasso with sklearn's LogisticRegression estimator, on both synthetic and real-world data. To make the two comparable, I zeroed both regularizers l1_reg and group_reg. This converts the sparse group-lasso problem into a standard logistic regression problem.

To reproduce the observations, run the attached script comparison_group_lasso_scikit.py. Those are the parameters to adjust

# Settings.
d = 2           # Number of features
n = 100         # Number of samples per class
p = 100         # Number of difficulty levels
n_iter = 100    # Number of max iterations

In the test, I generate a series of p=100 synthetic classification problems with increasing difficulty. The data consists of d=2 predictors with n=100 samples per class, the difficulty is increased by increasing the noise in the data. Here is how the point clouds look like for three different difficulties:

First of all, I observed that FISTA convergence almost always is a problem, unless the problem is very easy, that is the data is separable with a large-enough margin. Is there anything one can do to improve convergence?

Secondly, it appears that LogisticGroupLasso often is inferior to sklearn's LogisticRegression. Note that the average difference in prediction accuracy does not improve with a larger setting for n_iter.

Thirdly, LogisticGroupLasso sometimes yields good results, even FISTA has not converged. But in the majority of cases (about 60%) it "fails". What are the reasons for the "sometimes-character" of the problem? It looks as if the "failure rate" is not strongly dependent on the problem difficulty.

Fourthly, the failure rate is reduced for a larger number of features d, from 60% to below 10%. In other words, the results by FISTA are closer to those from lbfgs (the solver in LogisticRegression) if more features are present. For this test, I increased the number of samples n from 100 to 1000. Adding more evidence (with every dimension adding the same amount of information) makes the problem certainly easier. Note that this is not reflected by my "difficulty" indicator.

Any recommendations on how to improve the performance of LogisticGroupLasso? It's perfectly possible that I'm not using LogisticGroupLasso the right way. Here's how I perform the fitting:

def compute_logistic_group_lasso(X, y, rs, n_iter=100):
    LogisticGroupLasso.LOG_LOSSES = True
    gl = LogisticGroupLasso(
        groups=np.arange(X.shape[1]),
        group_reg=0.0,
        l1_reg=0.0,
        supress_warning=True,
        random_state=rs,
        n_iter=n_iter,
        tol=1e-4,
    )
    gl.fit(X, y)
    y_pred = gl.predict(X).ravel()
    accuracy = (y_pred == y).mean()
    score = gl.score(X, y)
    assert(accuracy==score)
    return score

Other feedback regarding usability:

sklearn returns 1d-arrays for estimator.predict() with shape (n,), LogisticGroupLasso.predict() returns 2d arrays with shape (n,1). I think one just has to call y.ravel() when appropriate.

normanius commented 4 years ago

Here the entire code to make sure it is not lost...


"""
Comparison of standard logistic regression
"""

import numpy as np
import matplotlib.pyplot as plt
from group_lasso import LogisticGroupLasso

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression

################################################################################
def compute_logistic_regression(X, y, rs, n_iter=100):
    """
    Note: sklearn's LogisticRegression by default applies regularization.
    In the experiments below, scores are lower with regularization (less
    overfitting), however the effect should not be big because d=2 << n=200.

    https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
    """
    lr = LogisticRegression(
        random_state=rs,
        solver="lbfgs",
        max_iter=n_iter,
        tol=1e-4,
    )
    lr.fit(X, y)
    y_pred = lr.predict(X)
    accuracy = (y_pred == y).mean()
    score = lr.score(X, y)
    assert(accuracy==score)
    return score

################################################################################
def compute_logistic_group_lasso(X, y, rs, n_iter=100):
    LogisticGroupLasso.LOG_LOSSES = True
    gl = LogisticGroupLasso(
        groups=np.arange(X.shape[1]),
        group_reg=0.0,
        l1_reg=0.0,
        supress_warning=True,
        random_state=rs,
        n_iter=n_iter,
        tol=1e-4,
    )
    gl.fit(X, y)
    y_pred = gl.predict(X).ravel()
    accuracy = (y_pred == y).mean()
    score = gl.score(X, y)
    assert(accuracy==score)
    return score

################################################################################
def generate_classification(n, d, noise, rs):
    """
    n:      number of samples per class
    d:      number of features
    noise:  amount of uncertainty added to
    rs:     random state

    Balanced classes, noise governed by a normal distribution,
    """
    # Create separable centers.
    dist = 10
    mu1 = rs.uniform(high=10, size=d)
    mu2 = rs.uniform(low=-1, high=1, size=d) # use as direction
    mu2 = mu2/np.linalg.norm(mu2)
    mu2 = mu1+dist*mu2

    # Noise
    n1 = n
    n2 = n
    X1 = rs.normal(mu1, noise, size=(n1,d))
    X2 = rs.normal(mu2, noise, size=(n2,d))
    y1 = np.ones(n1)
    y2 = np.zeros(n2)

    X = np.concatenate([X1,X2], axis=0)
    y = np.concatenate([y1,y2])

    # Shuffle
    i = rs.permutation(len(y))
    X = X[i,:]
    y = y[i]
    return X, y

################################################################################
# MAIN
################################################################################

# Compare LogisticGroupLasso with sklearn's LogisticRegression...
#   1) ... on synthetic data
#   2) ... on real-world data (the breast-cancer dataset)

rs = np.random.RandomState(42)

###########################################################
# Synthetic data
###########################################################
print("1) Synthethic problem...")

###########################################################
# Settings.
d = 2           # Number of features
n = 100         # Number of samples per class
p = 100         # Number of difficulty levels
n_iter = 100    # Number of max iterations
###########################################################

noises = np.linspace(1,10,p)
difficulty = (noises-min(noises))/(max(noises)-min(noises))
scores_lr = np.zeros(p)
scores_gl = np.zeros(p)
for i, noise in enumerate(noises):
    print("%d/%d" % (i+1, len(noises)))

    X, y = generate_classification(n=n,
                                   d=d,
                                   noise=noise,
                                   rs=rs)

    if False:
        plt.clf()
        plt.scatter(X[:,0], X[:,1], c=y)
        plt.grid("on", alpha=0.1)
        plt.axis("square")
        plt.xlim([-30,30])
        plt.ylim([-30,30])
        txt = "Difficulty: %.1f" % difficulty[i]
        plt.annotate(txt, xy=(0.0, 1), va="top", ha="left",
                     xycoords="axes fraction", textcoords="offset points", xytext=(12, -12))
        plt.savefig("output/data-%02d.pdf" % i,
                    transparent=False,
                    bbox_inches="tight")

    scores_lr[i] = compute_logistic_regression(X, y, rs)
    scores_gl[i] = compute_logistic_group_lasso(X, y, rs, n_iter=n_iter)

plt.style.use("seaborn-pastel")
plt.plot(difficulty, scores_lr, label="sklearn (n_iter=100)")
plt.plot(difficulty, scores_gl, label="group-lasso (n_iter=%s)" % n_iter)
plt.legend()
plt.ylim([-0.05,1.05])
plt.xlabel("noise level - problem difficulty")
plt.ylabel("prediction accuracy")
plt.title("Comparison of standard logistic regression")
txt = "Average diff: %.2f" % np.mean(scores_lr - scores_gl)
plt.annotate(txt, xy=(0.9, 0.1), va="top", ha="right",
             xycoords="axes fraction")
plt.plot([min(difficulty),max(difficulty)],[0.5,0.5], ":k", alpha=0.6)
plt.savefig(f"comparison-niter={n_iter}.pdf",
            transparent=False,
            bbox_inches="tight")

###########################################################
# 2) Real-world problem
###########################################################
# Real-world problem.
print()
print("2) Real world problem...")
X, y = load_breast_cancer(return_X_y=True)
score_lr = compute_logistic_regression(X, y, rs)
score_gl = compute_logistic_group_lasso(X, y, rs)

print("sklearn:    ", score_lr)  # 0.9472759226713533, lbfgs not converged
print("group-lasso:", score_gl)  # 0.8910369068541301, FISTA not converged
print()

normanius commented 4 years ago

A possible explanation for the observed behavior is that the (logistic) group lasso regression based on FISTA is not the best choice for low-dimensional problems. Is this interpretation correct?

yngvem commented 4 years ago

Thank you for this in depth analysis! I have looked at the code, at it seems like my Lipschitz bound was incorrectly computed. As a quick fix implemented a backtracking line search, and that fixed some of the problems, but not all. I will likely push those changes during this week.

I am looking into those failure cases on the evenings, but I don't have the time to look into it during working hours, as that time is mainly spent working on teaching duties.

As for the comparison, if you compare with L-BFGS, then we can expect somewhat worse performance, as quasi-Newton methods converge much faster than first order methods. For fair comparison, we should use the SAGA solver in sklearn. Changing to this solver does degrade the performance on the breast cancer dataset (0.92 instead of 0.94). Likewise, implementing a backtracing line search for FISTA improves the performance to 0.92.

However, on the easy simulated experiments, it still fails, and I am currently spending time figuring out why.

normanius commented 4 years ago

Glad to hear that the above makes sense and that you can identify a problem. Looking forward to updates. I can't be of help on the maths side, this is quite out of reach for me at the moment.

Just out of curiosity: Why are you not using a quasi-Newton method? Is it not suited to the type of problem to be solved here?

Regarding convergence: There was almost no benefit in prediction accuracy when going from 100 to 1000 iterations (given that the regularizers were in the "stable zone" #17) . The resulting prediction accuracy often stabilizes after 10-20 iterations already. (This is not a thoroughly tested statement, I was working with small- to moderately sized problems with 1-70 predictors and <1k samples). Observing that the result doesn't improve much (while the convergence criterion is not met), isn't it possible to devise an early-exit heuristic?

yngvem commented 4 years ago

I now have a culprit for when it fails in the simple experiments, it seems like the weights sum to zero when the convergence fails, so there is possibly some degeneracy here.

The reason I am not using quasi-newton methods is that they aren't suited for non-differentiable loss functions. This is also why you must use the SAGA solver for elastic net regression for logistic regression.

As for the Lipschitz constant, it is used as the inverse step length for gradient descent, so if it is too small, then the gradient updates are too long and we get oscillating behaviour.

Regarding convergence, there is really no easy way of specifying a convergence criteria for FISTA since it doesn't converge monotonically, meaning that the loss may increase between iterations. Currently, it is based on the relative change of the coefficients, but there may very well be a better stopping criterion.

yngvem commented 3 years ago

Sorry for being silent about this for months now, between changing jobs and COVID, I haven't had much time to look into these problems. I have now finally implemented the backtracking line search for FISTA in a non-hacky way and will hopefully push those changes to the master branch in a couple of days. Also, it seems like the problem lies with the conditioning of the linear systems. Simply centering the data before fitting yields similar performance to scikit-learn for both the simulated and real dataset.

I am now looking into if I somehow can include centering directly into the fitting procedure. My gut feeling is that it should be possible, but I need to look into the maths to ensure that I don't somehow change the effect of the regulariser by centering first.

yngvem commented 3 years ago

By inspecting this further, I also noticed a bug in the predict method of the logistic group lasso class, in which it didn't use the intercept for predict_proba or predict. I have also fixed that now.

normanius commented 3 years ago

Wow, great to hear about the progress! Ahh. You're right - centering (and possibly scaling) is often helpful for convergence. For some reason, I haven't played with this in the experiments. Thanks for pointing this out.

Not sure how other sklearn estimators deal with this. I got the impression that it's usually on the user-side to properly preprocess input data. Though it would be a nice feature to automatically do it. Alternatively, you could make a note in the docs that FISTA is very sensitive for uncentered data.

yngvem commented 3 years ago

I fixed it now, since we don't regularise the intercept, we can center the dataset before fitting the model and then we modify the intercept afterwards. This makes centering the data before fitting unecessary. I have implemented this fix in the latest version of the library.

Thanks for your thorough work finding these issues :) I'll hopefully be able to fix the scikit-learn compatibility issues over the summer, but likely not before august.

normanius commented 3 years ago

Take it easy :)

Will check out the fixes if I have a bit more time. Thank you for all the effort.

normanius commented 3 years ago

Looks great now. Thanks!

Here the comparison of sklearn and group-lasso for d=2 and n_iter=100. It looks similar for the other configurations as well. The solver behaves like expected now.

yngvem / group-lasso

LogisticGroupLasso is not running stably and seems inferior to scikit-learn's LogisticRegression #15