link accelerated BLAS/LAPACK

znmeb / edgyR

R on the Edge: NVIDIAⓇ Jetson™ tools for R developers

https://znmeb.github.io/edgyR/

GNU Affero General Public License v3.0

6 stars 4 forks source link

link accelerated BLAS/LAPACK #2

Open thorek1 opened 4 years ago

thorek1 commented 4 years ago

What about using a more recent version of openblas (for compilation see e.g. https://github.com/prdm0/ropenblas) + nvblas as the standard for the R session in RStudio server?

znmeb commented 4 years ago

That was actually on the original road map for this - I was planning to link to all the CUDA C math libraries via Rcpp in R functions. Once I get the Docker pieces done I'll be doing that.

There are some really good libraries in the CUDA math offering - see https://developer.nvidia.com/gpu-accelerated-libraries

thorek1 commented 4 years ago

Isn’t that what nvblas does for you. My understanding is it takes the CUDA math libraries and does everything on the gpu if it has an algorithm for it (e.g. GEMM). If not nvblas falls back to whatever cpu blas you give him (I would suggest openblas).

https://docs.nvidia.com/cuda/nvblas/index.html#Usage

znmeb commented 4 years ago

I might be able to do that with the Ubuntu "update-alternatives" infrastructure. See https://cran.r-project.org/doc/manuals/r-release/R-admin.html#Linear-algebra and https://wiki.debian.org/DebianScience/LinearAlgebraLibraries. If I can get that working it could go into v0.5.0

thorek1 commented 4 years ago

That could be one way but I didn’t find it neither in the nvblas documentation nor when searching for it

Other alternatives: manual symlink to libblas.so.3 (See ropenblas link_openblas function)

Or something along these lines: https://clint.id.au/?p=1900

thorek1 commented 4 years ago

Further research showed nvblas is nowhere to be found in the nano CUDA Toolkit.

Seems like that needs to be done by calling cuBLAS directly or using libraries like MAGMA, TensorFlow on top to make it a bit more userfriendly

znmeb commented 4 years ago

Yeah, OpenCL isn't there either, which means gpuR won't work either.

thorek1 commented 4 years ago

Julia (CUDA.jl) and Python (scikit-cuda) are a bit more advanced in that respect.

Nonetheless, for R there is: https://github.com/gpuRcore which sets up gpuR with CUDA And there are bindings to ArrayFire: https://gallery.rcpp.org/articles/introducing-rcpparrayfire/ which relies on CUDA

znmeb commented 4 years ago

I'll check it out - I think I looked at it a couple of months ago when I was trying to find alternatives to OpenCL. Some of those libraries aren't available for the Jetson from NVIDIA but may just need to be compiled like RStudio.

thorek1 commented 3 years ago

I finally got nvblas to run.

first: the Jetson forum was so kind to point me to the libnvblas.so

you can find it here: /usr/lib/aarch64-linux-gnu/libnvblas.so

second: following this guide you can add the following to the docker file RUN echo "NVBLAS_LOGFILE nvblas.log \ NVBLAS_CPU_BLAS_LIB /usr/lib/aarch64-linux-gnu/openblas-pthread/libblas.so.3 \ NVBLAS_GPU_LIST ALL" > /etc/nvblas.conf

and start R + nvblas on the command line using: NVBLAS_CONFIG_FILE=/etc/nvblas.conf LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libnvblas.so R

I used debian:sid-slim as a base image.

On a side note, I ran the R-benchmark-25 on the nano and the GPU results are actually worse than the CPU: GPU: 2800x2800 cross-product matrix (b = a' a)_____ (sec): 3.84666666666666 CPU: 2800x2800 cross-product matrix (b = a' a)_____ (sec): 1.596

I guess this highlights that you need to keep the GPU very busy in order to benefit from its power.

znmeb commented 3 years ago

Yeah, Amdahl's Law is a harsh reality. I doubt if this could be done for the small-ish GPUs in the Jetson modules, but it's perfectly possible to compile all of R to run in a GPU that has dedicated RAM, like a 2080 Super, 2080 Ti or Titan RTX. But I think there are better ways to program a GPU.

http://tensor-compiler.org/ https://github.com/plaidml/plaidml

thorek1 commented 3 years ago

But how would you compile all of the BLAS/LAPACK routines? Eigen has a CUDA interface which seems to be reasonably easy to work with. Rewriting OpenBLAS in CUDA seems unreasonable. Speed gains would only come if you translate the threading from OpenMP to CUDA. And if you want to cut the kernel load times (by kernel fusion) you would need to go for something like CUDA.jl or ArrayFire.

What do you think?

@dedicated memory: I think the jetsons have a shared address space (you can use zero copy) which cuts down on transfer times between CPU and GPU.

Interesting links you posted. I wasn't aware of those.