Optimizing quantum analysis

DanielCohenHillel commented 4 years ago

TL;DR: This program should've been written on a quantum computer... My PC really starting to straggle:\

After converting the code from using qutip to using numpy (see remove_qutip branch) I started playing around with it. One thing I noticed is takes an extremely long time to run for many modes. Just as an example, on the computer in our lab, running epra.analyze_all_variations(cos_trunc=8, fock_trunc=15) takes:

for 2 modes: 3 seconds
for 3 modes: 15 seconds
for 4 modes: Literally so long I never saw it finish (more than 20 mins, after that I stopped it). Moreover, qutip crushed due to memory problems.

This could make the code impractical for some complex applications (I don't really know if such applications exist, so it might be a non issue). This is when I got into the optimization rabbit hole. I'm posting here some of what I found so other people can look into it and try to optimize it, I think I'll do so in a few days. I believe that in a couple hours work a 10x-20x improvement is very possible (maybe even much more).

First thing, where that time is distributed (for 3 modes).

Basically all (99.5%) of the time is spent in the function epr_numerical_diagonalization in back_box_numeric.py, of which:
- 30% of the time in the function black_box_hamiltonian of which:
  - Almost 100% of the time is spent on a single line! nonlinear_part = dot(-fjs, map(cos, cos_interiors))
- 70% of the time in the function make_disspersive, I have not looked further into that function because for 4 modes, black_box_hamiltonian takes so long I never got past it.

A single line takes tens of minutes to complete and it's almost half of the total time , so I'll explain it a bit here. The dot() is referring to MatrixOps.dot which is an implementation of the dot product that uses no external libraries. My guess is that it was done this way instead of with np.dot or something is that it was a way of dealing with the weird shapes (list of qobj etc...).

fjs is just a number defined few lines up. cos is a function defined a few lines up, it uses MatrixOps.cos_approx to approximate the cosine of cos_interiors. This line is equivalent to nonlinear_part = -fjs * cos( cos_interiors ) where cos is not np.cos but a cosine tylor approximation.

If anyone wants to give it a go, I'd start by replacing every list with a ndarray. Right now there are many lists of numpy matrices, where all matrices have the same shape, so it is pretty inefficient to store them as a list instead of a 3d ndarray. This would also make it easy since it would allow to use numpy functions instead of loops, so you can just write cos( cos_interiors ) for example and that would work as expected (cos should be defined a bit differently though).

Note: I was writing this with hopes that after some optimization we could achieve the ability to do analysis on 5,6 even 7 modes. But as I was writing this I did some checks and found that because of memory constraints, 4 modes is probably the limit we can do. The memory usage is proportional to fock_levels^2*num_modes, for 4 modes and 15 fock levels that's around 19GB! I do believe 4 modes is possible in reasonable time with the right optimization.

zlatko-minev commented 4 years ago

Yes

That’s very good profiling of the code

It’s inefficient

Speed can be done with vectorization using numpy for most part as you say. More advanced, for loops and so on can use Cython. Not sure it’s really needed. Numpy should be fine.

Memory can also be reduced by allowing different number of fock levels for different modes. Some need more than others.

For more modes, I was working with some people on tensor network representations, but that’s probably not anytime soon. Have to see. There was a paper that came out.

Sent from my iPhone

On May 7, 2020, at 5:42 PM, Daniel Cohen Hillel notifications@github.com wrote:

TL;DR: This program should've been written on a quantum computer... My PC really starting to straggle:\

After converting the code from using qutip to using numpy (see remove_qutip branch) I started playing around with it. One thing I noticed is takes an extremely long time to run for many modes. Just as an example, on the computer in our lab, running epra.analyze_all_variations(cos_trunc=8, fock_trunc=15) takes:

for 2 modes: 3 seconds for 3 modes: 15 seconds for 4 modes: Literally so long I never saw it finish (more than 20 mins, after that I stopped it). Moreover, qutip crushed due to memory problems. This could make the code impractical for some complex applications (I don't really know if such applications exist, so it might be a non issue). This is when I got into the optimization rabbit hole. I'm posting here some of what I found so other people can look into it and try to optimize it, I think I'll do so in a few days. I believe that in a couple hours work a 10x-20x improvement is very possible (maybe even much more).

First thing, where that time is distributed (for 3 modes).

Basically all (99.5%) of the time is spent in the function epr_numerical_diagonalization in back_box_numeric.py, of which: 30% of the time in the function black_box_hamiltonian of which: Almost 100% of the time is spent on a single line! nonlinear_part = dot(-fjs, map(cos, cos_interiors)) 70% of the time in the function make_disspersive, I have not looked further into that function because for 4 modes, black_box_hamiltonian takes so long I never got past it. A single line takes tens of minutes to complete and it's almost half of the total time , so I'll explain it a bit here. The dot() is referring to MatrixOps.dot which is an implementation of the dot product that uses no external libraries. My guess is that it was done this way instead of with np.dot or something is that it was a way of dealing with the weird shapes (list of qobj etc...).

fjs is just a number defined few lines up. cos is a function defined a few lines up, it uses MatrixOps.cos_approx to approximate the cosine of cos_interiors. This line is equivalent to nonlinear_part = -fjs * cos( cos_interiors ) where cos is not np.cos but a cosine tylor approximation.

If anyone wants to give it a go, I'd start by replacing every list with a ndarray. Right now there are many lists of numpy matrices, where all matrices have the same shape, so it is pretty inefficient to store them as a list instead of a 3d ndarray. This would also make it easy since it would allow to use numpy functions instead of loops, so you can just write cos( cos_interiors ) for example and that would work as expected (cos should be defined a bit differently though).

Note: I was writing this with hopes that after some optimization we could achieve the ability to do analysis on 5,6 even 7 modes. But as I was writing this I did some checks and found that because of memory constraints, 4 modes is probably the limit we can do. The memory usage is proportional to fock_levels^2*num_modes, for 4 modes and 15 fock levels that's around 19GB! I do believe 4 modes is possible in reasonable time with the right optimization.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

DanielCohenHillel commented 4 years ago

I'd love to read that paper, if you can send it :)

Allowing different number of fock levels for different modes could go a long way, and it probably could be done without changing too much. We should put it on a TODO list or something.

Cython could give a nice performance boost but it would make the code harder to work with, I think this kind of optimization should be reserved for when we run out of ideas (maybe before a version 1.0 or something).

Now that you mention tensor networks, I haven't looked too much into how all the calculations are done, but if you can represent them as a computational graph we might be able to use a library such as tensorflow to handle the calculations. This would open a sea of possibilities and it will allow the use of a graphics card to do the calculations, it seems like the perfect job for a GPU. Plus it would make it easier to send the computation work to a remote server, altough I'm not sure if anyone really wants that feature 😕.

zlatko-minev commented 4 years ago

Yes, the paper is https://arxiv.org/abs/1912.01018.

I agree, we can put tensorflow and cython on the backburner for now. There are a few other things to sort out first.

From: Daniel Cohen Hillel notifications@github.com Reply-To: zlatko-minev/pyEPR reply@reply.github.com Date: Thursday, May 7, 2020 at 7:05 PM To: zlatko-minev/pyEPR pyEPR@noreply.github.com Cc: Zlatko Minev zminev@gmail.com, Comment comment@noreply.github.com Subject: Re: [zlatko-minev/pyEPR] Optimizing quantum analysis (#36)

I'd love to read that paper, if you can send it :)

Allowing different number of fock levels for different modes could go a long way, and it probably could be done without changing too much. We should put it on a TODO list or something.

Cython could give a nice performance boost but it would make the code harder to work with, I think this kind of optimization should be reserved for when we run out of ideas (maybe before a version 1.0 or something).

Now that you mention tensor networks, I haven't looked too much into how all the calculations are done, but if you can represent them as a computational graph we might be able to use a library such as tensorflow to handle the calculations. This would open a sea of possibilities and it will allow the use of a graphics card to do the calculations, it seems like the perfect job for a GPU. Plus it would make it easier to send the computation work to a remote server, altough I'm not sure if anyone really wants that feature 😕.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

nonhermitian commented 4 years ago

For 4 modes at 15 levels each the total Hilbert space size is 50625. For a dense complex array your looking at something like 38Gb of data. Sparse will help here although taking a cos truncation to the 8th order basically makes things dense anyway.

DanielCohenHillel commented 4 years ago

Yeah, fucking exponentials man...😒 Still there's a lot we can do although we're limited. Also many things don't require the full quantum analysis (the Kerr matrix for example if I remember correctly, and it's one of the most important results) but I don't think there's currently a way to get them without doing the full analysis, this should probably be added.

zlatko-minev / pyEPR

Optimizing quantum analysis #36