Open ymd-h opened 1 year ago
Yes, it's expected because path finding overhead can be large. It's best that you compute/cache the contraction path if you know in advance that a certain tensor network topology would be reused in your simulation (it can be nontrivial depending on your workload).
Thank you @leofang
I'm a still beginner, so that your experienced comment is pretty helpful. I will try to investigate further.
What I am interested in is variational quantum algorithms (VQA) using parameterized quantum circuit (PQC), especially QCL[1]. (Ref: Works at my another repo.)
In the setting, parameters of the circuit are changed at every execution.
Do you think the cache strategy still works?
I'd say so. For parametrized circuits (like this one that you had, with fixed arguments), the circuit topology is fixed. Then the same path can be reused even though you'd change the tensors inside the circuit.
if you provide more details we could try to give more accurate answer.
depending on the number of tensors and the size of each tensor performance can be defined. if your tensors are not tiny then the overhead is very minimal even for small network/circuit
what would be beneficial is if you give use an idea about the size of your circuit ( # tensors) and the approximate size (extent) of the tensors, how many of the same circuit you would like to contract, what is the average contraction time of your example as well as which cutensornet version.
@haidarazzam Thank you.
Tensor size is not clear yet. I want to find "good" circuit and my work is still in early stage. (Smaller circuit is more preferable for computation as long as it achieve enough result.)
Observed Issue
With relatively small circuit, I observed it took about 10 times longer than
"default.qubit"
.According to
%prun
profiling on Google Colab, the bottleneck iscuquantum.cutensornet.cutensornet.contraction_optimize
.Assumption
The main target of cuTensorNet is large circuit, so that for relatively small circuit its overhead is probably more significant than its speed-up.
Future Work
In default implementation
batch_execute()
callsexecute()
one by one serially. If we could pass batch circuits to cuTensorNet at once, it might be run parallelly on GPU.