xtensor-stack / xtensor-r

R bindings for xtensor
BSD 3-Clause "New" or "Revised" License
87 stars 15 forks source link

Repeated calls to an xtensor operation slow everything down #66

Closed DavisVaughan closed 5 years ago

DavisVaughan commented 5 years ago

Consider the following cpp file:

// [[Rcpp::depends(xtensor)]]
// [[Rcpp::plugins(cpp14)]]

#include <xtensor-r/rarray.hpp>
#include <xtensor-r/rtensor.hpp>
#include <Rcpp.h>

// [[Rcpp::export]]
SEXP rray_add_cpp(const xt::rarray<double>& x, const xt::rarray<double>& y) {
  xt::rarray<double> res = x + y;
  return res;
}

// [[Rcpp::export]]
SEXP rray_subtract_cpp(const xt::rarray<double>& x, const xt::rarray<double>& y) {
  xt::rarray<double> res = x - y;
  return res;
}

I was doing some benchmarks and was surprised to see that repeated calls back to this function slow it down.

library(tictoc)

Rcpp::sourceCpp("~/Desktop/test.cpp")

tic()
for(i in 1:2000) rray_add_cpp(1, 2)
toc()
#> 0.081 sec elapsed

tic()
for(i in 1:2000) rray_add_cpp(1, 2)
toc()
#> 0.201 sec elapsed

tic()
for(i in 1:2000) rray_add_cpp(1, 2)
toc()
#> 0.312 sec elapsed

tic()
for(i in 1:2000) rray_add_cpp(1, 2)
toc()
#> 0.371 sec elapsed

tic()
for(i in 1:2000) rray_add_cpp(1, 2)
toc()
#> 0.498 sec elapsed

tic()
for(i in 1:2000) rray_subtract_cpp(1, 2)
toc()
#> 0.61 sec elapsed

Eventually this get's out of control enough that it crashes. As you can see it also spreads to other functions, like the subtraction one. It happens when using sourceCpp() as I did here and inside a built package. Any idea what is happening? It doesn't seem to happen with rtensors interestingly enough.

// [[Rcpp::depends(xtensor)]]
// [[Rcpp::plugins(cpp14)]]

#include <xtensor-r/rtensor.hpp>
#include <Rcpp.h>

// [[Rcpp::export]]
SEXP rray_add_cpp_tensor(const xt::rtensor<double, 1>& x, const xt::rtensor<double, 1>& y) {
  xt::rtensor<double, 1> res = x + y;
  return res;
}
library(tictoc)

Rcpp::sourceCpp("~/Desktop/test.cpp")

tic()
for(i in 1:5000) rray_add_cpp_tensor(1, 2)
toc()
#> 0.06 sec elapsed

tic()
for(i in 1:5000) rray_add_cpp_tensor(1, 2)
toc()
#> 0.045 sec elapsed

tic()
for(i in 1:5000) rray_add_cpp_tensor(1, 2)
toc()
#> 0.022 sec elapsed

tic()
for(i in 1:5000) rray_add_cpp_tensor(1, 2)
toc()
#> 0.023 sec elapsed

tic()
for(i in 1:5000) rray_add_cpp_tensor(1, 2)
toc()
#> 0.023 sec elapsed
wolfv commented 5 years ago

Is there a way to check R's garbage collector? We might not free memory correctly, or perform some other weird side effect (that's the first thing that comes to my head). Since this behavior is different between tensor & array, it might be related to wrapping of the shape / stride storage in xtensor-R.

DavisVaughan commented 5 years ago

I think you might be right on the freeing memory idea.

Notice that here I am just creating that rarray<double>, but returning 1.

// [[Rcpp::depends(xtensor)]]
// [[Rcpp::plugins(cpp14)]]

#include <xtensor-r/rarray.hpp>
#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::IntegerVector add_cpp_noreturn(const xt::rarray<double>& x, const xt::rarray<double>& y) {
  xt::rarray<double> res = x + y;
  res = res + 1;
  Rcpp::IntegerVector z = Rcpp::IntegerVector::create(1);
  return z;
}

Here are some interesting memory results. Look especially at what happens if you just pass in vectors that aren't variables.

Rcpp::sourceCpp("~/Desktop/test.cpp")

library(tictoc)
library(pryr)

x <- as.double(1:1000)
y <- as.double(1:1000)

mem_used()
#> 33 MB

for(i in 1:2000) add_cpp_noreturn(x, y)
mem_used()
#> 37.8 MB

for(i in 1:2000) add_cpp_noreturn(x, y)
mem_used()
#> 38 MB

for(i in 1:2000) add_cpp_noreturn(x, y)
mem_used()
#> 38.2 MB

for(i in 1:2000) add_cpp_noreturn(x, y)
mem_used()
#> 38.4 MB

for(i in 1:2000) add_cpp_noreturn(as.double(1:1000), as.double(1:1000))
mem_used()
#> 72 MB

for(i in 1:2000) add_cpp_noreturn(as.double(1:1000), as.double(1:1000))
mem_used()
#> 105 MB

# R of course doesn't have this issue
for(i in 1:2000) x + y
mem_used()
#> 105 MB

# call an explicit gc. but the memory is still
# being held onto!
gc()
#>           used (Mb) gc trigger  (Mb) limit (Mb) max used (Mb)
#> Ncells  586906 31.4    1003967  53.7         NA   637081 34.1
#> Vcells 9077036 69.3   15662778 119.5      32768 11102034 84.8
mem_used()
#> 105 MB
DavisVaughan commented 5 years ago

With or without using variables, it seems to always slow down. I would guess they are still related issues though.

Rcpp::sourceCpp("~/Desktop/test.cpp")

library(tictoc)
library(pryr)

x <- as.double(1:1000)
y <- as.double(1:1000)

tic()
for(i in 1:2000) add_cpp_noreturn(x, y)
toc()
#> 0.181 sec elapsed

tic()
for(i in 1:2000) add_cpp_noreturn(x, y)
toc()
#> 0.362 sec elapsed

tic()
for(i in 1:2000) add_cpp_noreturn(as.double(1:1000), as.double(1:1000))
toc()
#> 0.545 sec elapsed

tic()
for(i in 1:2000) add_cpp_noreturn(as.double(1:1000), as.double(1:1000))
toc()
#> 0.78 sec elapsed
wolfv commented 5 years ago

Thanks for diving deeper into this. This means we need to decrement the reference count (or equivalent) somewhere. I think it might be the integer vector from the shape...

DavisVaughan commented 5 years ago

Is this wrong? https://github.com/QuantStack/xtensor-r/blob/master/include/xtensor-r/rcontainer.hpp#L190

I removed the if(m_owned) part and reran the above example and get:

Rcpp::sourceCpp("~/Desktop/test.cpp")

library(tictoc)
library(pryr)

x <- as.double(1:1000)
y <- as.double(1:1000)

mem_used()
#> 33 MB

for(i in 1:2000) add_cpp_noreturn(x, y)
mem_used()
#> 37.5 MB

for(i in 1:2000) add_cpp_noreturn(x, y)
mem_used()
#> 37.5 MB

for(i in 1:2000) add_cpp_noreturn(as.double(1:1000), as.double(1:1000))
mem_used()
#> 37.6 MB

for(i in 1:2000) add_cpp_noreturn(as.double(1:1000), as.double(1:1000))
mem_used()
#> 37.6 MB

and

Rcpp::sourceCpp("~/Desktop/test.cpp")

library(tictoc)
library(pryr)

x <- as.double(1:1000)
y <- as.double(1:1000)

tic()
for(i in 1:2000) add_cpp_noreturn(x, y)
toc()
#> 0.048 sec elapsed

tic()
for(i in 1:2000) add_cpp_noreturn(x, y)
toc()
#> 0.056 sec elapsed

tic()
for(i in 1:2000) add_cpp_noreturn(as.double(1:1000), as.double(1:1000))
toc()
#> 0.074 sec elapsed

tic()
for(i in 1:2000) add_cpp_noreturn(as.double(1:1000), as.double(1:1000))
toc()
#> 0.031 sec elapsed
DavisVaughan commented 5 years ago

This let's me show off why I like this library. Broadcast sum of 10000x1 and 1x10000. Competitive times to what you'd probably do in base R, and look at that memory usage

screen shot 2018-11-21 at 10 47 49 am
wolfv commented 5 years ago

Regarding m_owned: Not sure. There are some issues around move constructors and garbage collection which I think are taken care of by m_owned. Need to investigate that deeper.

Timings looks extremely promising! Ideally it should be faster ;) I think this special case of outer-sum broadcasting is something we haven't optimized to the max (yet).

Btw. would you be interested in having a chat about rray via video at some point? There are some other interesting ideas that lately came up regarding JIT compilation that you / the R community might be interested in.

DavisVaughan commented 5 years ago

Re m_owned) Okay, I'll happily hand the investigation off to you if that's alright, I'm shooting in the dark at this point to be honest. Happy to test anything though.

I've got a branch of xtensorrr that has xsimd in it, but haven't really seen the speed increase. Maybe it's just me. More investigation required. I'm using it like:

// [[Rcpp::depends(xtensorrr)]]
// [[Rcpp::plugins(cpp14)]]

#define XTENSOR_USE_XSIMD
#include <xtensor-r/rarray.hpp>
#include <Rcpp.h>

// [[Rcpp::export]]
SEXP rray_add_cpp_simd(xt::rarray<double> x, xt::rarray<double> y) {
  xt::rarray<double> res = x + y;
  return res;
}

And sure! My email is on my profile page here if you want to reach out and set something up https://github.com/DavisVaughan

wolfv commented 5 years ago

yep, there is definitely a problem with the owned variable. It needs to be true in many more cases. I'll have a fix by tomorrow.

DavisVaughan commented 5 years ago

Awesome thanks a lot!

wolfv commented 5 years ago

In my tests this is resolved with current master -- the issue was creation of new objects in R on call as we've debugged together. Thanks again!