xKDR / CRRao.jl

MIT License
33 stars 21 forks source link

Integrating GLMNet and Lasso #51

Open sourish-cmi opened 1 year ago

sourish-cmi commented 1 year ago

@ajaynshah @ayushpatnaikgit @mousum-github @SusanXKDR

I am tempted to integrate GLMNet and Lasso

https://github.com/JuliaStats/GLMNet.jl https://github.com/JuliaStats/Lasso.jl/blob/master/docs/src/index.md https://github.com/simonster/LARS.jl

I understand this Julia package wraps the Fortran code from glmnet.

Shall we integrate it to CRRao?

sourish-cmi commented 1 year ago

@ShouvikGhosh2048 @codetalker7

We need to do a performance check against R

sourish-cmi commented 1 year ago

Julia Code

using RDatasets

mtcars =  dataset("datasets", "mtcars")

y = Vector(mtcars[:,"MPG"])
x = Matrix(mtcars[:,["HP", "WT", "DRat", "QSec"]])

using GLMNet
using Random
Random.seed!(1234)
cv = glmnetcv(x, y) 
sourish-cmi commented 1 year ago

R code

attach(mtcars)
y <- mtcars$mpg

x <- data.matrix(mtcars[, c('hp', 'wt', 'drat', 'qsec')])

library(glmnet)
#perform 10-fold cross-validation to find optimal lambda value
set.seed(100)
cv_model <- cv.glmnet(x, y, alpha = 1,nfolds=10)
ShouvikGhosh2048 commented 1 year ago

I tried benchmarking the two programs (on WSL):

using RDatasets, GLMNet, Random, BenchmarkTools

mtcars = dataset("datasets", "mtcars")

y = Vector(mtcars[:,"MPG"])
x = Matrix(mtcars[:, ["HP", "WT", "DRat", "QSec"]])

@benchmark glmnetcv(x, y)
attach(mtcars)
library(microbenchmark)

y <- mtcars$mpg
x <- data.matrix(mtcars[, c('hp', 'wt', 'drat', 'qsec')])

library(glmnet)

microbenchmark(cv.glmnet(x, y, alpha = 1, nfolds = 10))

I got:

Julia: BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max): 307.400 μs … 6.012 ms ┊ GC (min … max): 0.00% … 78.04% Time (median): 321.000 μs ┊ GC (median): 0.00% Time (mean ± σ): 349.233 μs ± 213.283 μs ┊ GC (mean ± σ): 3.52% ± 5.38%

Memory estimate: 178.95 KiB, allocs estimate: 679.

R: Unit: milliseconds expr min lq mean median uq max neval cv.glmnet(x, y, alpha = 1, nfolds = 10) 49.2733 50.6152 53.89493 52.54825 54.24935 172.8879 100

The Julia programs takes 1/100th of the time taken by the R program. This seems wrong, I'm not sure if I made a mistake.

sourish-cmi commented 1 year ago

Let me check it on my end

sourish-cmi commented 1 year ago

Julia performance

julia> @benchmark glmnetcv(x, y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  293.541 μs …   2.972 ms  ┊ GC (min … max): 0.00% … 88.98%
 Time  (median):     305.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   313.728 μs ± 135.655 μs  ┊ GC (mean ± σ):  2.22% ±  4.59%

                  ▃▅▇████▅▃▁                                     
  ▂▁▂▂▂▂▂▂▂▂▂▃▃▄▅███████████▇▇▅▅▄▄▄▄▃▄▃▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
  294 μs           Histogram: frequency by time          327 μs <

 Memory estimate: 175.67 KiB, allocs estimate: 649.

R Performance

> microbenchmark(cv.glmnet(x, y, alpha = 1, nfolds = 10)
+                ,times = 100)
Unit: milliseconds
                                    expr     min       lq     mean
 cv.glmnet(x, y, alpha = 1, nfolds = 10) 16.8904 17.04585 19.07382
   median       uq      max neval
 17.23759 19.43857 88.26255   100
sourish-cmi commented 1 year ago

Old style benchmarking:

R performance

attach(mtcars)
library(microbenchmark)

y <- mtcars$mpg
x <- data.matrix(mtcars[, c('hp', 'wt', 'drat', 'qsec')])

library(glmnet)

microbenchmark(cv.glmnet(x, y, alpha = 1, nfolds = 10)
               ,times = 200)

start = Sys.time()
set.seed(10101)
for(i in 1:1000){
  fit =cv.glmnet(x, y, alpha = 1, nfolds = 10)
}
Sys.time()-start

Time difference of 18.86479 secs

Julia Performance

using RDatasets, GLMNet, Random, BenchmarkTools

mtcars = dataset("datasets", "mtcars")

y = Vector(mtcars[:,"MPG"])
x = Matrix(mtcars[:, ["HP", "WT", "DRat", "QSec"]])

start = Sys.time()
using Random
for i = 1:1000
    Random.seed!(1234)
    cv = glmnetcv(x, y) 
end
Sys.time()-start
0.35875892639160156

Performance gain

> 18.86479/0.35875892
[1] 52.58347

52 times performance gain

ShouvikGhosh2048 commented 1 year ago
using RDatasets, GLMNet, BenchmarkTools

mtcars = dataset("datasets", "mtcars")

y = Vector(mtcars[:,"MPG"])
x = Matrix(mtcars[:, ["HP", "WT", "DRat", "QSec"]])

@benchmark glmnet(x, y)
attach(mtcars)
library(microbenchmark)

y <- mtcars$mpg
x <- data.matrix(mtcars[, c('hp', 'wt', 'drat', 'qsec')])

library(glmnet)

microbenchmark(glmnet(x, y))

Julia:

BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  20.500 μs …  2.814 ms  ┊ GC (min … max): 0.00% … 98.16%
 Time  (median):     22.200 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.231 μs ± 47.710 μs  ┊ GC (mean ± σ):  3.34% ±  1.70%

 Memory estimate: 12.50 KiB, allocs estimate: 21.

R:

Unit: milliseconds
         expr    min     lq     mean  median     uq    max neval
 glmnet(x, y) 1.1302 1.1882 1.258889 1.24225 1.2888 2.5004   100

Julia glmnet output:

Least Squares GLMNet Solution Path (58 solutions for 4 predictors in 420 passes):
─────────────────────────────
      df   pct_dev          λ
─────────────────────────────
 [1]   0  0.0       5.14698
 [2]   1  0.127818  4.68974
 [3]   1  0.233934  4.27311
 [4]   1  0.322034  3.8935
 [5]   2  0.395479  3.54761
 [6]   2  0.46873   3.23245
 [7]   2  0.529522  2.94529
 [8]   2  0.579992  2.68364
 [9]   2  0.621893  2.44523
[10]   2  0.656659  2.22801
[11]   2  0.685543  2.03008
[12]   2  0.709524  1.84973
[13]   2  0.729433  1.6854
[14]   2  0.745961  1.53568
[15]   2  0.759684  1.39925
[16]   3  0.77273   1.27495
[17]   3  0.783621  1.16168
[18]   3  0.792664  1.05848
[19]   3  0.800171  0.96445
[20]   3  0.806403  0.878771
[21]   3  0.811577  0.800704
[22]   3  0.815873  0.729571
[23]   4  0.819758  0.664758
[24]   4  0.824109  0.605703
[25]   4  0.827712  0.551894
[26]   4  0.830713  0.502865
[27]   4  0.833204  0.458192
[28]   4  0.835266  0.417488
[29]   4  0.836983  0.380399
[30]   4  0.83841   0.346605
[31]   4  0.83959   0.315814
[32]   4  0.840573  0.287758
[33]   4  0.84139   0.262194
[34]   4  0.842065  0.238902
[35]   4  0.842628  0.217678
[36]   4  0.843096  0.19834
[37]   4  0.843483  0.18072
[38]   4  0.843805  0.164666
[39]   4  0.844074  0.150037
[40]   4  0.844295  0.136708
[41]   4  0.844479  0.124564
[42]   4  0.844633  0.113498
[43]   4  0.84476   0.103415
[44]   4  0.844866  0.0942278
[45]   4  0.844954  0.0858568
[46]   4  0.845026  0.0782295
[47]   4  0.845087  0.0712798
[48]   4  0.845137  0.0649475
[49]   4  0.845179  0.0591778
[50]   4  0.845214  0.0539206
[51]   4  0.845243  0.0491304
[52]   4  0.845267  0.0447658
[53]   4  0.845287  0.0407889
[54]   4  0.845303  0.0371654
[55]   4  0.845317  0.0338637
[56]   4  0.845329  0.0308553
[57]   4  0.845338  0.0281142
[58]   4  0.845346  0.0256166
─────────────────────────────

R glmnet output:

Call:  glmnet(x = x, y = y)

   Df  %Dev Lambda
1   0  0.00 5.1470
2   1 12.78 4.6900
3   1 23.39 4.2730
4   1 32.20 3.8940
5   2 39.55 3.5480
6   2 46.87 3.2320
7   2 52.95 2.9450
8   2 58.00 2.6840
9   2 62.19 2.4450
10  2 65.67 2.2280
11  2 68.55 2.0300
12  2 70.95 1.8500
13  2 72.94 1.6850
14  2 74.60 1.5360
15  2 75.97 1.3990
16  3 77.27 1.2750
17  3 78.36 1.1620
18  3 79.27 1.0580
19  3 80.02 0.9645
20  3 80.64 0.8788
21  3 81.16 0.8007
22  3 81.59 0.7296
23  4 81.98 0.6648
24  4 82.41 0.6057
25  4 82.77 0.5519
26  4 83.07 0.5029
27  4 83.32 0.4582
28  4 83.53 0.4175
29  4 83.70 0.3804
30  4 83.84 0.3466
31  4 83.96 0.3158
32  4 84.06 0.2878
33  4 84.14 0.2622
34  4 84.21 0.2389
35  4 84.26 0.2177
36  4 84.31 0.1983
37  4 84.35 0.1807
38  4 84.38 0.1647
39  4 84.41 0.1500
40  4 84.43 0.1367
41  4 84.45 0.1246
42  4 84.46 0.1135
43  4 84.48 0.1034
44  4 84.49 0.0942
45  4 84.50 0.0859
46  4 84.50 0.0782
47  4 84.51 0.0713
48  4 84.51 0.0649
49  4 84.52 0.0592
50  4 84.52 0.0539
51  4 84.52 0.0491
52  4 84.53 0.0448
53  4 84.53 0.0408
54  4 84.53 0.0372
55  4 84.53 0.0339
56  4 84.53 0.0309
57  4 84.53 0.0281
58  4 84.53 0.0256
sourish-cmi commented 1 year ago

Okay, this looks like only 4.5 times faster. This makes sense. In the cv_glmnet, the function does the sampling using some clever technique, resulting in a super performance gain.

sourish-cmi commented 1 year ago

As per our discussion and debate at the Goa conference - it appears it is better to have a native development of GLMNet.jl That is outside the periphery of CRRao. So we are putting it back on the back burner.

However, LASSO.jl is a native Julia package. So we will integrate LASSO.jl to CRRao

https://www.jstatsoft.org/article/view/v033i01

sourish-cmi commented 1 year ago

@ShouvikGhosh2048 @mousum-github @ajaynshah @codetalker7 @ayushpatnaikgit

Lookout this doc: https://juliastats.org/Lasso.jl/stable/lasso/

Lasso.jl depends completely on GLM.jl

That means for all GLM.jl models Lasso.jl will work. Top of that Lasso.jl is native Julia development

Hence we should integrate Lasso.jl as top priority