zdebruine / RcppML

Rcpp Machine Learning: Fast robust NMF, divisive clustering, and more
GNU General Public License v2.0
89 stars 15 forks source link

Upper-bounded NNLS #37

Closed zdebruine closed 1 year ago

zdebruine commented 1 year ago

Feature request to impose upper bound on NNLS solutions.

zdebruine commented 1 year ago

Addressed in 969c62eb4ff47ce7035bd8d534500ecc072e77df and 22953f8a4783b8d4bd4e5a098050d24d33d0de9a, upper bound now can be imposed in the nnls, predict.nmf, and project functions.

Theoretical advantages remain to be seen.

ambuehll commented 1 year ago

Hi Zach, super quick implementation! Thanks!! I am compiling the package on my Mac (using clang) and I am getting:

In file included from ../inst/include/RcppML.hpp:17:
../inst/include/RcppML/nmf.hpp:100:21: error: cannot assign to variable 'upper_bound' with const-qualified type 'const double'
        upper_bound = upper_bound;

which then fails to build the package. This doesn't happen in the CRAN version. Will check on a Win Server whether this persists.

ambuehll commented 1 year ago

No issues to build on Win. I'll check whether a different compiler might help on mac.

zdebruine commented 1 year ago

I'm not on a Mac but tried to address this with 4f670544313e8b2c44d423c31553384d21ece9dc removing the const specifier in the upperBound() setter for the nmf class.

Does the issue persist?

ambuehll commented 1 year ago

works! many thanks!!!!

zdebruine commented 1 year ago

Ok great. If you could confirm at some point that the results are similar to https://cran.r-project.org/web/packages/bvls/bvls.pdf that would be awesome!

ambuehll commented 1 year ago

OK, so I played around a bit with a small system (900x5000). Here some findings:

  1. the bound is not respected. there are values above the upper bound (500) in the solution.

  2. Comparing to bvls, RcppML is magnitudes faster: 9s vs 256s

  3. here a plot of the bvls vs RcppML parameters image

  4. here the same plot with setting all variables above 500 to 500. image

ambuehll commented 1 year ago

also, it would be amazing if the bounds could be set for each variable separately as in the bvls case.

zdebruine commented 1 year ago

Reprex? I'm getting all.equal and a perfect 1:1 plot in this case:

library(RcppML)
library(bvls)
data(hawaiibirds)
A <- hawaiibirds$counts
w <- RcppML::nmf(A, 10)@w
h <- project(w, A)
upper_bound <- 2
h_bounded_RcppML <- project(w, A, upper_bound = upper_bound)
h_bounded_bvls <- apply(as.matrix(A), 2, function(b) {
  bvls::bvls(w, b, rep(0, ncol(w)), rep(upper_bound, ncol(w)))$x
})
df <- data.frame("RcppML" = as.vector(h_bounded_RcppML), "bvls" = as.vector(h_bounded_bvls))
library(ggplot2)
all.equal(df$RcppML, df$bvls)
ggplot(df, aes(bvls, RcppML)) + geom_point()

Hm, I'll look at how we could set bounds separately. That's getting a bit more complicated :)

ambuehll commented 1 year ago

I can confirm that I it's a perfect 1:1 in your example. I'll post mine in a bit. need to re-run it...

ambuehll commented 1 year ago
library(data.table)
library(RcppML)
library(ggplot2)

A <- readRDS("A_matrix.RDS")
B <- readRDS("B_vector.RDS")

bl = rep(0,ncol(A))
bu = rep(500,ncol(A))

system.time({
  resnnls <- RcppML::nnls(a = crossprod(A), b= crossprod(A,B),cd_maxit = 1000, upper_bound = 500)
})

system.time({
  resnnls2 = bvls(crossprod(A), crossprod(A,B), bl, bu)
})
resnnls2 <- data.table(resnnls2$x)

resnnls <- data.table(resnnls)

resnnls[,bvls:=resnnls2$V1]
resnnls[,RcppML:=V1]
# 
library(ggplot2)
ggplot(resnnls)+geom_point(aes(bvls,RcppML))

with Matrix.zip

no out of bounds, but not 1:1. We'll try to reproduce the out of bounds.

zdebruine commented 1 year ago

First look, seems platform-dependent. On Linux I got out-of-bounds, on Windows I didn't, I have no clue why this is. Will look in a bit.

zdebruine commented 1 year ago

Finally got a good look at this, and I think that the code .

Here's what I get when I run your example:

image

Clearly not identical answers.

But now let's calculate the squared error of both solutions:

err_bvls_bvls <- sum((B - crossprod(t(A), as.matrix(resnnls$bvls)))^2)
err_RcppML_bvls <- sum((B - crossprod(t(A), as.matrix(resnnls$RcppML)))^2)
err_bvls_bvls
# 509807.1
err_RcppML_bvls
# 461442.1

If my math is correct and all, RcppML is actually doing better. I'd be curious if this is a strange edge case or if it holds true generally. Interested to get your feedback.

zdebruine commented 1 year ago

Closing due to inactivity and lack of theoretical clarity on how BVLS can be used within NMF.

I'm not even sure BVLS-NMF is weakly convex. For NNLS projections, BVLS can lead to suboptimal solutions with limited utility, it appears that different algorithms can discover different minima (i.e. my coordinate descent BVLS vs. the 1979 BVLS implementation in Fortran using active set methods). Further discussion welcome. I am tentatively planning to isolate BVLS in it's own function in the next major release and remove it's support in NMF or NNLS functions to simplify the codebase.