ycroissant / plm

Panel Data Econometrics with R
GNU General Public License v2.0
49 stars 13 forks source link

Suggestion: use collapse::rsplit/gsplit to speed up other code #33

Closed SebKrantz closed 1 year ago

SebKrantz commented 1 year ago

Hi Kevin, having {collapse} as a hard dependency provides many opportunities for further drop-in performance improvements, one particularly significant one that came to mind would probably be to use the faster split functions: rsplit() being a recursive version of split(), that, with arguments drop = FALSE and flatten = TRUE works just like split(). There is also a barebones function called gsplit(), which is even faster as it only works for vectors and by default does not save the names.

library(collapse)
library(microbenchmark)

v = wlddev$PCGDP
f = wlddev$iso3c
g = GRP(f)

# Vector
microbenchmark(gsplit(v, g), rsplit(v, g), rsplit(v, f), split(v, f))
#> Unit: microseconds
#>          expr     min       lq     mean   median       uq      max neval cld
#>  gsplit(v, g)  85.680 116.0250 151.1944 126.5120 166.6745  537.730   100 a  
#>  rsplit(v, g)  94.160 124.9500 165.5100 136.9995 189.4330  580.123   100 a  
#>  rsplit(v, f) 195.458 227.3640 322.0670 242.9830 270.2040 4041.218   100  b 
#>   split(v, f) 346.289 390.9145 475.0852 403.4090 489.3115 1404.788   100   c

# Data Frame
microbenchmark(rsplit(wlddev, g), rsplit(wlddev, f), split(wlddev, f))
#> Unit: milliseconds
#>               expr       min        lq      mean    median        uq        max neval cld
#>  rsplit(wlddev, g)  1.242354  1.395417  1.999074  1.533084  2.075499  11.572538   100  a 
#>  rsplit(wlddev, f)  1.318662  1.480204  1.915902  1.617872  2.362436   4.278622   100  a 
#>   split(wlddev, f) 36.952416 42.953552 52.031342 50.696393 58.052336 106.491162   100   b

Created on 2022-09-13 by the reprex package (v0.3.0)

tappek commented 1 year ago

Hi Sebastian, Great, thank you for the hint! For the latest CRAN release, we changed a lot of code from a for-loop to way more efficient split-approach, so this seems like a further easy speed-up!

rsplit does not seem to support matrices unlike base R's split does (via its default method)?

What we do a lot is this:

# split matrix X by individual and store in list
 X.col <- NCOL(X)
 tX.list <- split(X, ind)  # gives list of vectors
 tX.list <- lapply(tX.list, function(m) matrix(m, ncol = X.col)) # transform list of vectors to list of matrices

Where the last line is somewhat annoying but necessary as split splits out vectors and not matrices.

SebKrantz commented 1 year ago

Hi Kevin, that's right, rsplit() has no matrix method. I could add it for the next release though. The main reason it is not there is because I never required it. The good news is that gsplit() has the option to keep the first argument empty (NULL), which will return indices that can be used to subset the matrix:

library(collapse)

X = qM(mtcars)
f = qF(mtcars$cyl)

X_spl = lapply(gsplit(g = f), function(i) X[i, ])
str(X_spl)
#> List of 3
#>  $ : num [1:11, 1:11] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
#>   .. ..$ : chr [1:11] "mpg" "cyl" "disp" "hp" ...
#>  $ : num [1:7, 1:11] 21 21 21.4 18.1 19.2 17.8 19.7 6 6 6 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:7] "Mazda RX4" "Mazda RX4 Wag" "Hornet 4 Drive" "Valiant" ...
#>   .. ..$ : chr [1:11] "mpg" "cyl" "disp" "hp" ...
#>  $ : num [1:14, 1:11] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:14] "Hornet Sportabout" "Duster 360" "Merc 450SE" "Merc 450SL" ...
#>   .. ..$ : chr [1:11] "mpg" "cyl" "disp" "hp" ...

Created on 2022-09-13 by the reprex package (v2.0.1)

SebKrantz commented 1 year ago

Also note that 'GRP' objects are more efficient inputs for all of these functions (factors will be converted to 'GRP' objects). So if you can, use g = GRP(mtcars$cyl) instead of qF(mtcars$cyl).

tappek commented 1 year ago

A matrix method would be convenient! Shall I file an issue in collpase's repository?

SebKrantz commented 1 year ago

A minor update of collapse just hit CRAN, which includes rsplit.matrix. I had to update earlier than planned due to issues with newer C compilers now used on R-devel checks. So you can move ahead with this.

tappek commented 1 year ago

Hope to have the few remnants of base R's split() substituted before long and will then release.

Btw: Since plm 2.6-2, we make already use of collapse::qtable (with commit 82092f6).

SebKrantz commented 1 year ago

Great, happy to see things moving forward.