Closed normanius closed 3 years ago
Here the entire code to make sure it is not lost...
"""
Comparison of standard logistic regression
"""
import numpy as np
import matplotlib.pyplot as plt
from group_lasso import LogisticGroupLasso
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
################################################################################
def compute_logistic_regression(X, y, rs, n_iter=100):
"""
Note: sklearn's LogisticRegression by default applies regularization.
In the experiments below, scores are lower with regularization (less
overfitting), however the effect should not be big because d=2 << n=200.
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
"""
lr = LogisticRegression(
random_state=rs,
solver="lbfgs",
max_iter=n_iter,
tol=1e-4,
)
lr.fit(X, y)
y_pred = lr.predict(X)
accuracy = (y_pred == y).mean()
score = lr.score(X, y)
assert(accuracy==score)
return score
################################################################################
def compute_logistic_group_lasso(X, y, rs, n_iter=100):
LogisticGroupLasso.LOG_LOSSES = True
gl = LogisticGroupLasso(
groups=np.arange(X.shape[1]),
group_reg=0.0,
l1_reg=0.0,
supress_warning=True,
random_state=rs,
n_iter=n_iter,
tol=1e-4,
)
gl.fit(X, y)
y_pred = gl.predict(X).ravel()
accuracy = (y_pred == y).mean()
score = gl.score(X, y)
assert(accuracy==score)
return score
################################################################################
def generate_classification(n, d, noise, rs):
"""
n: number of samples per class
d: number of features
noise: amount of uncertainty added to
rs: random state
Balanced classes, noise governed by a normal distribution,
"""
# Create separable centers.
dist = 10
mu1 = rs.uniform(high=10, size=d)
mu2 = rs.uniform(low=-1, high=1, size=d) # use as direction
mu2 = mu2/np.linalg.norm(mu2)
mu2 = mu1+dist*mu2
# Noise
n1 = n
n2 = n
X1 = rs.normal(mu1, noise, size=(n1,d))
X2 = rs.normal(mu2, noise, size=(n2,d))
y1 = np.ones(n1)
y2 = np.zeros(n2)
X = np.concatenate([X1,X2], axis=0)
y = np.concatenate([y1,y2])
# Shuffle
i = rs.permutation(len(y))
X = X[i,:]
y = y[i]
return X, y
################################################################################
# MAIN
################################################################################
# Compare LogisticGroupLasso with sklearn's LogisticRegression...
# 1) ... on synthetic data
# 2) ... on real-world data (the breast-cancer dataset)
rs = np.random.RandomState(42)
###########################################################
# Synthetic data
###########################################################
print("1) Synthethic problem...")
###########################################################
# Settings.
d = 2 # Number of features
n = 100 # Number of samples per class
p = 100 # Number of difficulty levels
n_iter = 100 # Number of max iterations
###########################################################
noises = np.linspace(1,10,p)
difficulty = (noises-min(noises))/(max(noises)-min(noises))
scores_lr = np.zeros(p)
scores_gl = np.zeros(p)
for i, noise in enumerate(noises):
print("%d/%d" % (i+1, len(noises)))
X, y = generate_classification(n=n,
d=d,
noise=noise,
rs=rs)
if False:
plt.clf()
plt.scatter(X[:,0], X[:,1], c=y)
plt.grid("on", alpha=0.1)
plt.axis("square")
plt.xlim([-30,30])
plt.ylim([-30,30])
txt = "Difficulty: %.1f" % difficulty[i]
plt.annotate(txt, xy=(0.0, 1), va="top", ha="left",
xycoords="axes fraction", textcoords="offset points", xytext=(12, -12))
plt.savefig("output/data-%02d.pdf" % i,
transparent=False,
bbox_inches="tight")
scores_lr[i] = compute_logistic_regression(X, y, rs)
scores_gl[i] = compute_logistic_group_lasso(X, y, rs, n_iter=n_iter)
plt.style.use("seaborn-pastel")
plt.plot(difficulty, scores_lr, label="sklearn (n_iter=100)")
plt.plot(difficulty, scores_gl, label="group-lasso (n_iter=%s)" % n_iter)
plt.legend()
plt.ylim([-0.05,1.05])
plt.xlabel("noise level - problem difficulty")
plt.ylabel("prediction accuracy")
plt.title("Comparison of standard logistic regression")
txt = "Average diff: %.2f" % np.mean(scores_lr - scores_gl)
plt.annotate(txt, xy=(0.9, 0.1), va="top", ha="right",
xycoords="axes fraction")
plt.plot([min(difficulty),max(difficulty)],[0.5,0.5], ":k", alpha=0.6)
plt.savefig(f"comparison-niter={n_iter}.pdf",
transparent=False,
bbox_inches="tight")
###########################################################
# 2) Real-world problem
###########################################################
# Real-world problem.
print()
print("2) Real world problem...")
X, y = load_breast_cancer(return_X_y=True)
score_lr = compute_logistic_regression(X, y, rs)
score_gl = compute_logistic_group_lasso(X, y, rs)
print("sklearn: ", score_lr) # 0.9472759226713533, lbfgs not converged
print("group-lasso:", score_gl) # 0.8910369068541301, FISTA not converged
print()
A possible explanation for the observed behavior is that the (logistic) group lasso regression based on FISTA is not the best choice for low-dimensional problems. Is this interpretation correct?
Thank you for this in depth analysis! I have looked at the code, at it seems like my Lipschitz bound was incorrectly computed. As a quick fix implemented a backtracking line search, and that fixed some of the problems, but not all. I will likely push those changes during this week.
I am looking into those failure cases on the evenings, but I don't have the time to look into it during working hours, as that time is mainly spent working on teaching duties.
As for the comparison, if you compare with L-BFGS, then we can expect somewhat worse performance, as quasi-Newton methods converge much faster than first order methods. For fair comparison, we should use the SAGA solver in sklearn. Changing to this solver does degrade the performance on the breast cancer dataset (0.92 instead of 0.94). Likewise, implementing a backtracing line search for FISTA improves the performance to 0.92.
However, on the easy simulated experiments, it still fails, and I am currently spending time figuring out why.
Glad to hear that the above makes sense and that you can identify a problem. Looking forward to updates. I can't be of help on the maths side, this is quite out of reach for me at the moment.
Just out of curiosity: Why are you not using a quasi-Newton method? Is it not suited to the type of problem to be solved here?
Regarding convergence: There was almost no benefit in prediction accuracy when going from 100 to 1000 iterations (given that the regularizers were in the "stable zone" #17) . The resulting prediction accuracy often stabilizes after 10-20 iterations already. (This is not a thoroughly tested statement, I was working with small- to moderately sized problems with 1-70 predictors and <1k samples). Observing that the result doesn't improve much (while the convergence criterion is not met), isn't it possible to devise an early-exit heuristic?
I now have a culprit for when it fails in the simple experiments, it seems like the weights sum to zero when the convergence fails, so there is possibly some degeneracy here.
The reason I am not using quasi-newton methods is that they aren't suited for non-differentiable loss functions. This is also why you must use the SAGA solver for elastic net regression for logistic regression.
As for the Lipschitz constant, it is used as the inverse step length for gradient descent, so if it is too small, then the gradient updates are too long and we get oscillating behaviour.
Regarding convergence, there is really no easy way of specifying a convergence criteria for FISTA since it doesn't converge monotonically, meaning that the loss may increase between iterations. Currently, it is based on the relative change of the coefficients, but there may very well be a better stopping criterion.
Sorry for being silent about this for months now, between changing jobs and COVID, I haven't had much time to look into these problems. I have now finally implemented the backtracking line search for FISTA in a non-hacky way and will hopefully push those changes to the master branch in a couple of days. Also, it seems like the problem lies with the conditioning of the linear systems. Simply centering the data before fitting yields similar performance to scikit-learn for both the simulated and real dataset.
I am now looking into if I somehow can include centering directly into the fitting procedure. My gut feeling is that it should be possible, but I need to look into the maths to ensure that I don't somehow change the effect of the regulariser by centering first.
By inspecting this further, I also noticed a bug in the predict method of the logistic group lasso class, in which it didn't use the intercept for predict_proba or predict. I have also fixed that now.
Wow, great to hear about the progress! Ahh. You're right - centering (and possibly scaling) is often helpful for convergence. For some reason, I haven't played with this in the experiments. Thanks for pointing this out.
Not sure how other sklearn estimators deal with this. I got the impression that it's usually on the user-side to properly preprocess input data. Though it would be a nice feature to automatically do it. Alternatively, you could make a note in the docs that FISTA is very sensitive for uncentered data.
I fixed it now, since we don't regularise the intercept, we can center the dataset before fitting the model and then we modify the intercept afterwards. This makes centering the data before fitting unecessary. I have implemented this fix in the latest version of the library.
Thanks for your thorough work finding these issues :) I'll hopefully be able to fix the scikit-learn compatibility issues over the summer, but likely not before august.
Take it easy :)
Will check out the fixes if I have a bit more time. Thank you for all the effort.
Looks great now. Thanks!
Here the comparison of sklearn
and group-lasso
for d=2
and n_iter=100
. It looks similar for the other configurations as well. The solver behaves like expected now.
Hi Yngve
I have been experimenting a bit with
LogisticGroupLasso
(for binary classification) and found it not to work super reliably, yet.LogisticGroupLasso
, from what I have understood, is a recently added extension of the core functionality: sparse group lasso for regression. Is it possible that the logistic regression variant suffers from some early illnesses, still?To demonstrate what I mean, I compared
LogisticGroupLasso
with sklearn'sLogisticRegression
estimator, on both synthetic and real-world data. To make the two comparable, I zeroed both regularizersl1_reg
andgroup_reg
. This converts the sparse group-lasso problem into a standard logistic regression problem.To reproduce the observations, run the attached script comparison_group_lasso_scikit.py. Those are the parameters to adjust
In the test, I generate a series of p=100 synthetic classification problems with increasing difficulty. The data consists of d=2 predictors with n=100 samples per class, the difficulty is increased by increasing the noise in the data. Here is how the point clouds look like for three different difficulties:
First of all, I observed that FISTA convergence almost always is a problem, unless the problem is very easy, that is the data is separable with a large-enough margin. Is there anything one can do to improve convergence?
Secondly, it appears that
LogisticGroupLasso
often is inferior to sklearn'sLogisticRegression
. Note that the average difference in prediction accuracy does not improve with a larger setting forn_iter
.Thirdly,
LogisticGroupLasso
sometimes yields good results, even FISTA has not converged. But in the majority of cases (about 60%) it "fails". What are the reasons for the "sometimes-character" of the problem? It looks as if the "failure rate" is not strongly dependent on the problem difficulty.Fourthly, the failure rate is reduced for a larger number of features
d
, from 60% to below 10%. In other words, the results by FISTA are closer to those fromlbfgs
(the solver inLogisticRegression
) if more features are present. For this test, I increased the number of samplesn
from 100 to 1000. Adding more evidence (with every dimension adding the same amount of information) makes the problem certainly easier. Note that this is not reflected by my "difficulty" indicator.Any recommendations on how to improve the performance of
LogisticGroupLasso
? It's perfectly possible that I'm not usingLogisticGroupLasso
the right way. Here's how I perform the fitting:Other feedback regarding usability:
estimator.predict()
with shape(n,)
,LogisticGroupLasso.predict()
returns 2d arrays with shape (n,1). I think one just has to cally.ravel()
when appropriate.