Open sourish-cmi opened 1 year ago
@ShouvikGhosh2048 @codetalker7
We need to do a performance check against R
Julia Code
using RDatasets
mtcars = dataset("datasets", "mtcars")
y = Vector(mtcars[:,"MPG"])
x = Matrix(mtcars[:,["HP", "WT", "DRat", "QSec"]])
using GLMNet
using Random
Random.seed!(1234)
cv = glmnetcv(x, y)
R code
attach(mtcars)
y <- mtcars$mpg
x <- data.matrix(mtcars[, c('hp', 'wt', 'drat', 'qsec')])
library(glmnet)
#perform 10-fold cross-validation to find optimal lambda value
set.seed(100)
cv_model <- cv.glmnet(x, y, alpha = 1,nfolds=10)
I tried benchmarking the two programs (on WSL):
using RDatasets, GLMNet, Random, BenchmarkTools
mtcars = dataset("datasets", "mtcars")
y = Vector(mtcars[:,"MPG"])
x = Matrix(mtcars[:, ["HP", "WT", "DRat", "QSec"]])
@benchmark glmnetcv(x, y)
attach(mtcars)
library(microbenchmark)
y <- mtcars$mpg
x <- data.matrix(mtcars[, c('hp', 'wt', 'drat', 'qsec')])
library(glmnet)
microbenchmark(cv.glmnet(x, y, alpha = 1, nfolds = 10))
I got:
Julia: BenchmarkTools.Trial: 10000 samples with 1 evaluation. Range (min … max): 307.400 μs … 6.012 ms ┊ GC (min … max): 0.00% … 78.04% Time (median): 321.000 μs ┊ GC (median): 0.00% Time (mean ± σ): 349.233 μs ± 213.283 μs ┊ GC (mean ± σ): 3.52% ± 5.38%
Memory estimate: 178.95 KiB, allocs estimate: 679.
R: Unit: milliseconds expr min lq mean median uq max neval cv.glmnet(x, y, alpha = 1, nfolds = 10) 49.2733 50.6152 53.89493 52.54825 54.24935 172.8879 100
The Julia programs takes 1/100th of the time taken by the R program. This seems wrong, I'm not sure if I made a mistake.
Let me check it on my end
Julia performance
julia> @benchmark glmnetcv(x, y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 293.541 μs … 2.972 ms ┊ GC (min … max): 0.00% … 88.98%
Time (median): 305.500 μs ┊ GC (median): 0.00%
Time (mean ± σ): 313.728 μs ± 135.655 μs ┊ GC (mean ± σ): 2.22% ± 4.59%
▃▅▇████▅▃▁
▂▁▂▂▂▂▂▂▂▂▂▃▃▄▅███████████▇▇▅▅▄▄▄▄▃▄▃▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
294 μs Histogram: frequency by time 327 μs <
Memory estimate: 175.67 KiB, allocs estimate: 649.
R Performance
> microbenchmark(cv.glmnet(x, y, alpha = 1, nfolds = 10)
+ ,times = 100)
Unit: milliseconds
expr min lq mean
cv.glmnet(x, y, alpha = 1, nfolds = 10) 16.8904 17.04585 19.07382
median uq max neval
17.23759 19.43857 88.26255 100
Old style benchmarking:
R performance
attach(mtcars)
library(microbenchmark)
y <- mtcars$mpg
x <- data.matrix(mtcars[, c('hp', 'wt', 'drat', 'qsec')])
library(glmnet)
microbenchmark(cv.glmnet(x, y, alpha = 1, nfolds = 10)
,times = 200)
start = Sys.time()
set.seed(10101)
for(i in 1:1000){
fit =cv.glmnet(x, y, alpha = 1, nfolds = 10)
}
Sys.time()-start
Time difference of 18.86479 secs
Julia Performance
using RDatasets, GLMNet, Random, BenchmarkTools
mtcars = dataset("datasets", "mtcars")
y = Vector(mtcars[:,"MPG"])
x = Matrix(mtcars[:, ["HP", "WT", "DRat", "QSec"]])
start = Sys.time()
using Random
for i = 1:1000
Random.seed!(1234)
cv = glmnetcv(x, y)
end
Sys.time()-start
0.35875892639160156
Performance gain
> 18.86479/0.35875892
[1] 52.58347
52 times performance gain
using RDatasets, GLMNet, BenchmarkTools
mtcars = dataset("datasets", "mtcars")
y = Vector(mtcars[:,"MPG"])
x = Matrix(mtcars[:, ["HP", "WT", "DRat", "QSec"]])
@benchmark glmnet(x, y)
attach(mtcars)
library(microbenchmark)
y <- mtcars$mpg
x <- data.matrix(mtcars[, c('hp', 'wt', 'drat', 'qsec')])
library(glmnet)
microbenchmark(glmnet(x, y))
Julia:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 20.500 μs … 2.814 ms ┊ GC (min … max): 0.00% … 98.16%
Time (median): 22.200 μs ┊ GC (median): 0.00%
Time (mean ± σ): 24.231 μs ± 47.710 μs ┊ GC (mean ± σ): 3.34% ± 1.70%
Memory estimate: 12.50 KiB, allocs estimate: 21.
R:
Unit: milliseconds
expr min lq mean median uq max neval
glmnet(x, y) 1.1302 1.1882 1.258889 1.24225 1.2888 2.5004 100
Julia glmnet output:
Least Squares GLMNet Solution Path (58 solutions for 4 predictors in 420 passes):
─────────────────────────────
df pct_dev λ
─────────────────────────────
[1] 0 0.0 5.14698
[2] 1 0.127818 4.68974
[3] 1 0.233934 4.27311
[4] 1 0.322034 3.8935
[5] 2 0.395479 3.54761
[6] 2 0.46873 3.23245
[7] 2 0.529522 2.94529
[8] 2 0.579992 2.68364
[9] 2 0.621893 2.44523
[10] 2 0.656659 2.22801
[11] 2 0.685543 2.03008
[12] 2 0.709524 1.84973
[13] 2 0.729433 1.6854
[14] 2 0.745961 1.53568
[15] 2 0.759684 1.39925
[16] 3 0.77273 1.27495
[17] 3 0.783621 1.16168
[18] 3 0.792664 1.05848
[19] 3 0.800171 0.96445
[20] 3 0.806403 0.878771
[21] 3 0.811577 0.800704
[22] 3 0.815873 0.729571
[23] 4 0.819758 0.664758
[24] 4 0.824109 0.605703
[25] 4 0.827712 0.551894
[26] 4 0.830713 0.502865
[27] 4 0.833204 0.458192
[28] 4 0.835266 0.417488
[29] 4 0.836983 0.380399
[30] 4 0.83841 0.346605
[31] 4 0.83959 0.315814
[32] 4 0.840573 0.287758
[33] 4 0.84139 0.262194
[34] 4 0.842065 0.238902
[35] 4 0.842628 0.217678
[36] 4 0.843096 0.19834
[37] 4 0.843483 0.18072
[38] 4 0.843805 0.164666
[39] 4 0.844074 0.150037
[40] 4 0.844295 0.136708
[41] 4 0.844479 0.124564
[42] 4 0.844633 0.113498
[43] 4 0.84476 0.103415
[44] 4 0.844866 0.0942278
[45] 4 0.844954 0.0858568
[46] 4 0.845026 0.0782295
[47] 4 0.845087 0.0712798
[48] 4 0.845137 0.0649475
[49] 4 0.845179 0.0591778
[50] 4 0.845214 0.0539206
[51] 4 0.845243 0.0491304
[52] 4 0.845267 0.0447658
[53] 4 0.845287 0.0407889
[54] 4 0.845303 0.0371654
[55] 4 0.845317 0.0338637
[56] 4 0.845329 0.0308553
[57] 4 0.845338 0.0281142
[58] 4 0.845346 0.0256166
─────────────────────────────
R glmnet output:
Call: glmnet(x = x, y = y)
Df %Dev Lambda
1 0 0.00 5.1470
2 1 12.78 4.6900
3 1 23.39 4.2730
4 1 32.20 3.8940
5 2 39.55 3.5480
6 2 46.87 3.2320
7 2 52.95 2.9450
8 2 58.00 2.6840
9 2 62.19 2.4450
10 2 65.67 2.2280
11 2 68.55 2.0300
12 2 70.95 1.8500
13 2 72.94 1.6850
14 2 74.60 1.5360
15 2 75.97 1.3990
16 3 77.27 1.2750
17 3 78.36 1.1620
18 3 79.27 1.0580
19 3 80.02 0.9645
20 3 80.64 0.8788
21 3 81.16 0.8007
22 3 81.59 0.7296
23 4 81.98 0.6648
24 4 82.41 0.6057
25 4 82.77 0.5519
26 4 83.07 0.5029
27 4 83.32 0.4582
28 4 83.53 0.4175
29 4 83.70 0.3804
30 4 83.84 0.3466
31 4 83.96 0.3158
32 4 84.06 0.2878
33 4 84.14 0.2622
34 4 84.21 0.2389
35 4 84.26 0.2177
36 4 84.31 0.1983
37 4 84.35 0.1807
38 4 84.38 0.1647
39 4 84.41 0.1500
40 4 84.43 0.1367
41 4 84.45 0.1246
42 4 84.46 0.1135
43 4 84.48 0.1034
44 4 84.49 0.0942
45 4 84.50 0.0859
46 4 84.50 0.0782
47 4 84.51 0.0713
48 4 84.51 0.0649
49 4 84.52 0.0592
50 4 84.52 0.0539
51 4 84.52 0.0491
52 4 84.53 0.0448
53 4 84.53 0.0408
54 4 84.53 0.0372
55 4 84.53 0.0339
56 4 84.53 0.0309
57 4 84.53 0.0281
58 4 84.53 0.0256
Okay, this looks like only 4.5 times faster. This makes sense. In the cv_glmnet
, the function does the sampling using some clever technique, resulting in a super performance gain.
As per our discussion and debate at the Goa conference - it appears it is better to have a native development of GLMNet.jl That is outside the periphery of CRRao. So we are putting it back on the back burner.
However, LASSO.jl is a native Julia package. So we will integrate LASSO.jl to CRRao
@ShouvikGhosh2048 @mousum-github @ajaynshah @codetalker7 @ayushpatnaikgit
Lookout this doc: https://juliastats.org/Lasso.jl/stable/lasso/
Lasso.jl depends completely on GLM.jl
That means for all GLM.jl models Lasso.jl will work. Top of that Lasso.jl is native Julia development
Hence we should integrate Lasso.jl as top priority
@ajaynshah @ayushpatnaikgit @mousum-github @SusanXKDR
I am tempted to integrate GLMNet and Lasso
https://github.com/JuliaStats/GLMNet.jl https://github.com/JuliaStats/Lasso.jl/blob/master/docs/src/index.md https://github.com/simonster/LARS.jl
I understand this Julia package wraps the Fortran code from glmnet.
Shall we integrate it to CRRao?