Request of qulacs update and question about precision of simulation

corryvrequan commented 4 years ago

Hi, I'm a contributor of qulacs. First of all, thanks for adding our library qulacs to this nice benchmark project. I've checked qulacs benchmark script, and confirmed it is implemented in an efficient way.

On the other hand, I have two following request/question about benchmarks.

update of qulacs package

Though our library was incompatible with the latest gcc in the previous version, now we believe "pip install qulacs" works for all the recent gcc (and it is merged to our SIMD codes). Can I ask you to try it, and replace build script for forked repository with pypi package install ("qulacs==0.1.8" in requirements.txt)? If you will do gpu benchmarking with the same project, (qulacs-gpu==0.1.8) might be better, which enables both CPU/GPU simulation, but fails to build without CUDA.

precision of complex values

As far as I know, for example, cirq will perform simulation with complex64 by default (https://cirq.readthedocs.io/en/stable/generated/cirq.Simulator.html), but qulacs compute with complex128. Is there any regulation about precision? I think benchmarks should be done in the same precision if possible.

Thanks,

Roger-luo commented 4 years ago

now we believe "pip install qulacs" works for all the recent gcc (and it is merged to our SIMD codes).

nice, I'll update it this weekend.

cirq will perform simulation with complex64 by default

thanks, I didn't notice that, Cirq reviewer didn't mention either, and yes I tried to use the double precision complex in all benchmarks. In principal we need to make sure every package use it, so I'll need to go through this again.

Roger-luo commented 4 years ago

@corryvrequan Hi I just tried qulacs/qulacs-gpu=0.1.8 and it seems to error on my machine with the previous problem (C++14 requirement) can you also merge the patch to gpu? or this is due to some other reason? error msg

/tmp/pip-install-mmba7267/qulacs-gpu/build/temp.linux-x86_64-3.7/_deps/pybind11_fetch-src/include/pybind11/detail/common.h: In instantiation of ‘struct pybind11::overload_cast<const QuantumStateGpu*, const QuantumStateGpu*>’:
  /tmp/pip-install-mmba7267/qulacs-gpu/python/cppsim_wrapper.cpp:220:120:   required from here
  /tmp/pip-install-mmba7267/qulacs-gpu/build/temp.linux-x86_64-3.7/_deps/pybind11_fetch-src/include/pybind11/detail/common.h:755:19: error: static assertion failed: pybind11::overload_cast<...> requires compiling in C++14 mode
       static_assert(detail::deferred_t<std::false_type, Args...>::value,
                     ^~~~~~
  /tmp/pip-install-mmba7267/qulacs-gpu/python/cppsim_wrapper.cpp: In function ‘void pybind11_init_qulacs(pybind11::module&)’:
  /tmp/pip-install-mmba7267/qulacs-gpu/python/cppsim_wrapper.cpp:220:120: error: no matching function for call to ‘pybind11::overload_cast<const QuantumStateGpu*, const QuantumStateGpu*>::overload_cast(<unresolved overloaded function type>)’
       mstate.def("inner_product", py::overload_cast<const QuantumStateGpu*, const QuantumStateGpu*>(&state::inner_product));
                                                                                                                          ^
  In file included from /tmp/pip-install-mmba7267/qulacs-gpu/build/temp.linux-x86_64-3.7/_deps/pybind11_fetch-src/include/pybind11/pytypes.h:12,
                   from /tmp/pip-install-mmba7267/qulacs-gpu/build/temp.linux-x86_64-3.7/_deps/pybind11_fetch-src/include/pybind11/cast.h:13,

Roger-luo commented 4 years ago

Oh our test machine works! I don't know why it still doesn't work on my machine with a Titan XP card however.

PS. @corryvrequan let me know if you have a preferred name in the acknowledge. (instead of github account name)

corryvrequan commented 4 years ago

We are very sorry for bothering you with the same error. We found that C++14 error does not occur in g++-7 and g++-9, but occurs in some g++-8. Since g++9 is not supported by the latest CUDA, I guess your GPU machine uses the latest supported one, i.e. g++8. This will be fixed in the next update (update may be done during this weekend.). Let me inform you after update.

instead of github account name

If it is okay, I would like to use my github account name as a reviewer. (If you want to use real name account as a reviewer, let me introduce another account of qulacs's core contributor with real-name.)

Roger-luo commented 4 years ago

@corryvrequan I mean the preferred name in acknowledgement in our paper. I'm not sure if you still want to use the github name in the paper. But I'm OK with both.

I've updated the latest benchmark, the new updates about SIMD in qulacs did improve the performance a lot, 3 of the single gate benchmark is a bit faster than ours (as expected). The CUDA benchmark is a bit slower than ours, but not much. Let me know if you think there's anything I missed here.

Regarding to Cirq, I've changed the simulation precision to double, it seems not affecting the benchmark much. However, this is something we should keep in mind. Thanks a lot on this.

corryvrequan commented 4 years ago

I'm sorry for late reply though you prepare for you paper.

acknowledgement

Thank you for considering adding my name to your acknowledgement. I'm sorry but I want to use this github name for this activity with some reasons. In my sense, it is probably strange to use github name in acknowledgement of published paper, so I don't care if you don't mention me in your paper. Of course add my github name to acknowledgement is welcome.

SIMD and GPU

Yes. As you expected, our speed-up seems mainly from merged SIMD codes. (We did some additional tuning for CPU which doesn't support AVX2, but this is not related to this benchmark.)

As for GPU, Interestingly, we observed that which GPU simulator is faster is dependent on environment. I think this is because our CUDA's codes are optimized only for our own specific GPU. Anyway, your current benchmarking codes of qulacs-GPU are written in the most efficient way, so I agree that current results and your discussion are fair and concrete for both CPU/GPU.

cirq

Did you forget to update precision Simulator instance in test_QCBM? https://github.com/Roger-luo/quantum-benchmarks/blob/master/cirq/benchmarks.py#L91

I think it is possible that simulation time becomes not so long by change of precision.

Roger-luo commented 4 years ago

@corryvrequan yes, I forget that! thanks!

Of course add my github name to acknowledgement is welcome.

No problem, we will use your github name in the paper as well.

I think it is possible that simulation time becomes not so long by change of precision.

I guess there is not much BLAS involved anyway, so the performance are mainly limited by swapping. But I'll run this again.

As for GPU, Interestingly, we observed that which GPU simulator is faster is dependent on environment. I think this is because our CUDA's codes are optimized only for our own specific GPU.

I listed the machine info as well, what is your GPU card?

Thanks for the comments, I'll let you know when the paper is public!

corryvrequan commented 4 years ago

What is your GPU card.

Sorry, I misunderstood about our benchmark environments. We are also using Tesla V100. Though qulacs-gpu's time is almost the same as values listed in this repository, we observed about x1.7 slower results in CuYao's benchmarks. I've attached our environment and scores.

Environment: Azure Cloud NC6s_v3
OS: Ubuntu 18.04.3 LTS
Compiler: gcc 7.4.0
CPU: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
GPU: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB]
 driver: 430.26 
 CUDA: 10.2
python: 3.6.9
 numpy 1.17.3
 qulacs-GPU 0.1.8

> julia --project
julia> versioninfo()
Julia Version 1.0.5
Commit 3af96bcefc (2019-09-09 19:06 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, broadwell)
julia> using Pkg
julia> Pkg.installed()
Dict{String,Union{Nothing, VersionNumber}} with 10 entries:
  "CSV"              => v"0.5.13"
  "YaoArrayRegister" => v"0.5.0"
  "Pkg"              => nothing
  "CuArrays"         => v"1.2.1"
  "YaoBlocks"        => v"0.7.0"
  "DataFrames"       => v"0.19.4"
  "LinearAlgebra"    => nothing
  "BenchmarkTools"   => v"0.4.3"
  "CuYao"            => v"0.1.3"
  "Yao"              => v"0.6.0"

qulacs.txt

yao_qcbm.txt

pcircuit

p.s. I also found some lines in scripts are outdated.

L35 in bin/benchmark: We don't need setup.py command for qulacs.
L99 in requirements.txt: We don't need line for qulacs since qulacs-gpu contains functions of qulacs. Installing both may cause unintended conflict.

(Update 2019/11/8)

I've updated julia from 1.0.5 -> 1.2.0, and I could get the same performance relation observed in this repository. (If my memory is correct, v1.2.0 was not recommended due to package system conflict when I tried to install this.) Thus, speed-down of CuYao is due to old julia. I've attached the latest scores.

yao_qcbm_v1.2.0.txt

Roger-luo commented 4 years ago

Interesting, I didn't compare different Julia versions before, but yes I think there were some updates on the PTX backend for Julia afterwards. So could you try julia 1.2 or 1.3-rc4 see if that works? The CPU benchmark seems to match our results so I guess there are some compiler improvements in later versions. (CUDAnative is not CUDA C/C++, it is a julia transpiler to PTX so it relays on julia compiler heavily)

Yes, we have our own Tesla V100 server, but I don't think it should have so much difference, given I believe there is a constant overhead from python while running the parameterized circuit benchmark. (this benchmark is made to evaluate the overhead of abstraction etc. and the single gate benchmark is for implementation of simulated gate instructions)

I also found some lines in scripts are outdated.

thanks for noticing me, I'm actually planning to update the scripts and run all the benchmark again later since I also got some comments from ProjectQ authors.

If my memory is correct, v1.2.0 was not recommended due to package system conflict when I tried to install this

I think it should be fixed already now with the current develop version on master branch (since we had some hack on the julia internals), but feel free to let me know if you hit anything.

corryvrequan commented 4 years ago

So could you try julia 1.2 or 1.3-rc4 see if that works?

I've created benchmark plots with qulacs, Yao with julia v1.0.5, and Yao with julia v1.2.0.

result

Not only update of julia improves GPU performance at large nqubits, but also it improves CPU performance at small nqubits. However, even with julia v1.2.0, I couldn't regenerate Yao's smaller overhead of GPU simulation at qubits fewer than 18 shown in RESULT.md. Probably this acceleration is due to other reasons.

Roger-luo commented 4 years ago

@corryvrequan this is an interesting result, I didn't expect there was such a difference.

but our backend is implemented in a very generic way, CPU and GPU backend actually share a lot code, so I won't be surprise that a well tuned C++/CUDA C implementation is faster (especially not significantly faster) I will put a comment in the results direct to this issue.

Roger-luo commented 4 years ago

I think I'll close this issue since we have fix this a while ago.

yardstiq / quantum-benchmarks

Request of qulacs update and question about precision of simulation #6