wright-group / WrightSim

A simulation package for multidimensional spectroscopy.
MIT License
3 stars 0 forks source link

WS-specific Low-level matrix multiplication #40

Open kameyer226 opened 4 years ago

kameyer226 commented 4 years ago

Because model WrightSim Hamiltonians have a specific size and number of non-zero elements, it is possible to create and compile a low-level matrix-vector multiplication method to be used in WrightSim propagate.py at near run-time, or more likely well in advance. The elements would be analyzed and a specific multiplication code compiled, bypassing numpy's dot and or matmul methods. The result could be a run time decrease by the sparseness of the Hamiltonian ( up to nearly 1/ length of density elements , for example leading to a time 90% faster for a 10-length density vector).

Here is a discussion of whether such a concept would be worthy and implementation be viable and how it might be accomplished.

ksunden commented 4 years ago

Technically yes, however, it's not clear to me that what we would implement would actually be faster...

On the Python side, multiplication is implemented by numpy (in C) and is likely to be faster than a python implementation, even one only using the nonzero elements.

On the CUDA side, the computation is basically free already, with most of the time transferring data back and forth from the computer.

More practically, we could look closer at using sparse matrices (e.g. scipy.sparse or the newer sparse pypi package. These will have optimized impleemntations for sparse multiplication. I will note that sparse matricies take ~3x the amount of nonzero elements to store, so if we are above 30%, which is pretty close to where we are for some that we have implemented already, we don't save in terms of data contents.

kameyer226 commented 4 years ago

I hope this section to be a thorough dissemination and testing of ideas, even if common sense tells us it shouldn't help. (Although some just won't work, like using a python script to do the multiplication!) Ultimately the Hamiltonian may be significantly populated and home-built, index-by-index multiplications be foolish. But the Hamiltonians are triangular, and it might be possible to achieve a 50% speed boost just for that, as it appears (at first look) that numpy is not applying specialized LAPACK/BLAS functions for triangular matrices at this time. (See CTRMV vs CGEMV LAPACK.)

kameyer226 commented 2 years ago

link https://github.com/kameyer226/WrightSim-Cython