Improve sklearn compatibility

normanius commented 5 years ago

Because I need better sklearn support for my needs, I started to look at your implementation and punctually improved a couple of things.

You may want to have a look at my fork here: https://github.com/normanius/group-lasso/tree/normanius/sklearn-compliance I don't ask for a pull request yet because it is work-in-progress. Happy to contribute a bit more if I find the time.

I made use check_estimator and mixins (BaseEstimator, ClassifierMixin, LinearClassifierMixin) to check how compliant the estimator is with sklearn. (See also the sklearn contribution guidelines.) Currently, GroupLasso is not fully compliant with ClassifierMixin nor with LinearClassifierMixin because it lacks support for multi-class classification and sparse data. The mixins come also with their tests.

I assembled a list of improvements that I've partially realized already. Feel free to use the contribution. Because I am not very familiar with the algorithmic aspects of group lasso, someone definitely should have a closer look at the changes.

[Fixed] predict_proba and decision_function are missing. predict does not return class predictions but scores. (My fix is temporary, functions predict and decision_function are copied from LinearClassifierMixin - they should be removed once GroupLasso inherits LinearClassifierMixin)
[Fixed] The results are not reproducible for repeated runs of fit(). This is because GroupLasso uses random coefficient initialization. There are sklearn guidelines for the handling of random states.
[Fixed] The labels y are not necessarily numeric. In particular, grad() makes use of the assumption that y takes numerical values. In my fix, I employ LabelEncoder to convert y into a numeric vector. But not entirely sure if this is the canonical way to do it. Furthermore, there is no multi-class support at the moment. Have a look at LogisticRegression to see how multi-class is handled there.
[Fixed] I strongly dislike the group assignment format (list of tuples with start/stop indices). Instead, I would simply go for a list of group identifier. For example if X has 5 features, a possible group assignment could be groups = [3,1,1,2,3]. If the features must be organized specially (e.g. ordering of the features such that groups are [1,1,2,3,3]), it can be done under the hood. I added _check_and_format_groups() to convert the group assignment into the original list-of-tuples format - though I haven't checked if this can optimized/simplified any further.
[Open] No support of intercepts, as in LogisticRegression. This should not be a big deal. I introduced self.intercept_ = 0 - but it is not computed at the moment.
[Open] Support for multi-class classification (with more than two labels)
[Open] Sparse-data support
[Open] Is it an option to derive from LogisticRegression and only override fit()?

There is certainly more to do. But this is what I assembled so far.

yngvem commented 5 years ago

Thanks a lot for your work! I'll definitely look further into it once the summer starts and I'm done grading exams.

[Fixed] predict_proba and decision_function are missing. predict does not return class predictions but scores. (My fix is temporary, functions predict and decision_function are copied from LinearClassifierMixin - they should be removed once GroupLasso inherits LinearClassifierMixin)

I am not entirely sure that we get interpretable probabilities when we minimise the MSE loss instead of the cross-entropy (CE) loss. I implemented group lasso as a regression algorithm, not a classification algorithm so I think the right thing to do here is to move everything but the gradient computation in a base class and have one subclass for GroupLassoRegression (MSE loss) and one for LogisticGroupLasso (CE loss). Alternatively, we can set the loss at init.

It has been a long-term goal to implement a solver for the CE loss, but I wanted a working least-squares algorithm first because that is easier to implement. If we want to implement a group-lasso regularised CE solver, then we need to find an upper bound of the Lipschitz coefficient for the gradient of the cross-entropy loss. I'll do some maths on the train back from work tomorrow and see how difficult it will be to implement. Alternatively, there are line-search methods that should work, but these are more complex than simply computing the Lipschitz coefficient.

[Fixed] The results are not reproducible for repeated runs of fit(). This is because GroupLasso uses random coefficient initialization. There are sklearn guidelines for the handling of random states.

Thank you for this, it was on my todo list, but I hadn't come to it yet.

[Fixed] The labels y are not necessarily numeric. In particular, grad() makes use of the assumption that y takes numerical values. In my fix, I employ LabelEncoder to convert y into a numeric vector. But not entirely sure if this is the canonical way to do it. Furthermore, there is no multi-class support at the moment. Have a look at LogisticRegression to see how multi-class is handled there.

This is also a purely classification feature.

[Fixed] I strongly dislike the group assignment format (list of tuples with start/stop indices). Instead, I would simply go for a list of group identifier. For example if X has 5 features, a possible group assignment could be groups = [3,1,1,2,3]. If the features must be organized specially (e.g. ordering of the features such that groups are [1,1,2,3,3]), it can be done under the hood. I added _check_and_format_groups() to convert the group assignment into the original list-of-tuples format - though I haven't checked if this can optimized/simplified any further.

Your solution is much better! I don't even remember why I chose the format that I did.

[Open] No support of intercepts, as in LogisticRegression. This should not be a big deal. I introduced self.intercept_ = 0 - but it is not computed at the moment.

I agree, this should not be difficult to implement.

[Open] Support for multi-class classification (with more than two labels)

Multi-class group lasso is not super trivial as it is unclear whether we want sparsity across the different labels (i.e. do we accept that a different set of groups are used to check if the output is a cat than if the output is a dog). It should probably be a flag to control this behaviour.

[Open] Sparse-data support

Yes! This will, however, be much work for me as I haven't worked much with sparse matrices in Python.

[Open] Is it an option to derive from LogisticRegression and only override fit()?

There should be, we might want to override init as well to set the groups. Again, I didn't do this since I used a regression model as a base.

normanius commented 5 years ago

Whoops, I was so focused on classification (I encountered group lasso always in this context) that I misinterpreted your estimator as a (work-in-progress) classifier. Sorry about this! Some of my comments certainly don't make immediately sense, then. I should've made the effort to actually understand what optimization problem is solved in _fista() :)

I think the right thing to do here is to move everything but the gradient computation in a base class and have one subclass for GroupLassoRegression (MSE loss) and one for LogisticGroupLasso (CE loss). Alternatively, we can set the loss at init.

Splitting the functionality into two classes is probably simpler/better because of the different interfaces for classifiers and regressors.

Regarding the extra features: one thing after the other :) Me too, I don't have a lot of experience with sparse matrices. For the classifier with multi-class, one could get inspiration from LogisticRegression. On first glimpse, it didn't look tooooo involved. But I didn't really think it through. The docu distinguishes between two modes:

In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’.

and elsewhere it is written:

For a multi_class problem, if multi_class is set to be "multinomial" the softmax function is used to find the predicted probability of each class. Else use a one-vs-rest approach, i.e calculate the probability of each class assuming it to be positive using the logistic function and normalize these values across all the classes.

yngvem commented 5 years ago

You can actually use linear regression models for classification problems. This is done frequently in chemometrics, where you often have more features than samples. PLSR (which lies somewhere between linear regression and PCA) is therefore used to predict dummy encoded variables.

However, when people use linear functions for classification problems, then we cannot directly interpret their outputs as probabilities, but rather as a scoring. I'd recommend encoding your variables as +1/-1 pairs and using group lasso to predict on these. Once you've selected your variables, you can use them to fit a logistic model as well if you want the probabilities. Otherwise, just set a hard threshold at 0.

This rambling made me realise that we might want to use group lasso as a transformer, returning the dataset after removing unused variables for it to fit nicely into a pipeline.

After a quick google search for logistic group lasso, I found this paper. However, their algorithm is block coordinate descent which is famously slow (orders of magnitude slower than FISTA).

marketneutral commented 4 years ago

However, when people use linear functions for classification problems, then we cannot directly interpret their outputs as probabilities, but rather as a scoring. I'd recommend encoding your variables as +1/-1 pairs and using group lasso to predict on these. Once you've selected your variables, you can use them to fit a logistic model as well if you want the probabilities. Otherwise, just set a hard threshold at 0. This rambling made me realise that we might want to use group lasso as a transformer, returning the dataset after removing unused variables for it to fit nicely into a pipeline.

I am also interested in group lasso for logistic regression. Do I read this correctly that you recommend the following pipeline:

1) encode target vector as {-1, 1} from {0, 1}. 2) use group-lasso to run a regression on this target vector 3) take the non-zero features from step 2 and then run a logistic regression

If that's right, why do we need step 1? Why not just run the regression with {0, 1} as target?

yngvem commented 4 years ago

Hi,

I am sorry for the late response. My suggestions above were for an old version of the codebase, I have now implemented logistic (2 class) and multinomial (multi-class) group lasso. As to the +-1 bit, it is irrelevant now that I have implemented intercepts.

Also, I aim on improving the documentation later this year (maybe next year), but that will have to wait until my teaching workload has reduced and my tendonitis is better.

I hope this was at least somewhat helpful.

yngvem commented 4 years ago

All issues posted here are moved to separate issues. The only issue left is to inherit from the corresponding class in scikit-learn (issue #4)

normanius commented 4 years ago

Great work! Thank you Yngve!!! Will have a look at it in the next couple of days. I'm reawakening a project that might benefit from your tool.

yngvem commented 4 years ago

Thank you! Just one note. If you updated the past week, then you should update once more because I introduced a bug in the logistic regression class for two-class problems. This is fixed in v1.3.1.

yngvem / group-lasso

Improve sklearn compatibility #2