zdebruine / RcppML

Rcpp Machine Learning: Fast robust NMF, divisive clustering, and more
GNU General Public License v2.0
89 stars 15 forks source link

How to perform orthogonal NMF and other questions. #29

Closed fenguoerbian closed 2 years ago

fenguoerbian commented 2 years ago

First of all, thanks for this great package, it's blazing fast! But when I read the paper and the code I got confused. Here are some questions:

  1. In the NMF Problem Definition section, right above equation (1), you said NMF seeks to decompose A into orthogonal lower-rank non-negative matrices w and h. How is this achieved? The algorithm seems to be based on solving linear systems, I don't see any thing about orthogonalizing the solution. Also if w is orthogonal, does this mean that w^Tw = I and in equation (3) a is just the identity matrix?

  2. In predict.hpp, when applying the L1 penalty, the script if (L1 != 0) b.array() -= L1; then solve it just like ordinary linear regression. Is this an approximation solution for the regression with L1 penalty since in general that's not how it's solved.

  3. Also equations (4) seems to contain a typo: To update for ith column of h(h_i), given the ith column of A(A_i) and w, the NMF problem is A_i = w h_i, and the corresponding system is w^T A_i = w^T w h_i, which means in (4) it should write b = w^T A_i.

zdebruine commented 2 years ago

Thanks for the note!

  1. You caught me, no of course NMF is not orthogonal, it's colinear. I don't know how that sentence made it this far, it's plainly wrong.
  2. I am subtracting the L1 penalty from b in ax = b, and for L2 I subtract the penalty from the diagonal of a. This is how it was done in previous implementations (NNLM). How would you implement an L1 penalty?
  3. Thank you for the detailed reading. You are correct, eqn (4) should read b = w^T A_i and this is how it is implemented in the code.

I am particularly interested in hearing from you on point 2. We will update the manuscript in the future to address points 1 and 3.

fenguoerbian commented 2 years ago

I usually just perform the coordinate descent algorithm because I don't know the closed form solution to the general lasso problem. Can you point to me the reference for the NNLM? I'm very new to this NMF field. I sent an email to you since it's difficult to write the math equations here. I want to make sure we're talking about the same L1 penalty.

zdebruine commented 2 years ago

I usually just perform the coordinate descent algorithm because I don't know the closed form solution to the general lasso problem. Can you point to me the reference for the NNLM? I'm very new to this NMF field. I sent an email to you since it's difficult to write the math equations here. I want to make sure we're talking about the same L1 penalty.

Thanks for the note here and for the email with the mathematical rundown. Indeed, I have implemented a LASSO/L1-like penalty useful in NMF, as introduced first by Kim and Park (https://academic.oup.com/bioinformatics/article/23/12/1495/225472). This differs slightly in formulation, though not intuition, from equivalent least squares implementations.

Note particularly this statement from the Kim and Park manuscript:

"...the algorithm needs column normalization of W during iterations. However, the normalization of W changes the objective function, and this makes convergence analysis difficult."

It is unintuitive to mathematically formulate this approach, but Kim and Park do it far better than I can.

Here is a reference to NNLM: https://github.com/linxihui/NNLM. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3312-5. Importantly, NNLM NMF does not scale the factors during model fitting, so the distributions (and consequentially the penalty) is variable between random restarts and across factors, as Kim and Park note. It is therefore an incorrect application of the penalty.

RcppML L1 penalization does exactly what we would expect from an L1-penalized least squares solution. It is absolutely guaranteed to equally increase sparsity of every factor without affecting the most important features/samples in each factor. This facilitates identification of the most meaningful features, and reduces the long tail of noise in each factor. It also can stabilize the model when much of the data is masked/missing, because the number of possible solution paths are shrunk towards paths defined by sets of more colinear signals. Intuitively, this is exactly what an L1/LASSO penalty should do.

Feel free to comment further, but based on the Kim and Park publication I'm going to continue with the use of an "L1" terminology because it behaves like an "L1" penalty and there is no better published method for convex sparsification of an NMF model.

fenguoerbian commented 2 years ago

Thanks a lot! After sending the email I also found some paper confirming that in non-negative regression the hard thresholding(your script) implementation of L1 is better than the soft thresholding(usually in general lasso). The terminology is perfectly OK, it's just me that lack the knowledge about non-negative regression. I need to do more study about this and thanks a lot for your help and great package!