Closed andreasg123 closed 5 years ago
The updated version uses small matrix slices instead of single columns. That improves the performance such that at a non-zero fraction of 0.01, the parallel version only takes 20% longer than the sequential version. Larger slices than the selected 16 columns at a time would reduce the parallelism such that the benefit for less sparse matrices would be reduced.
Thanks. This looks good!
With your latest commit https://github.com/wo80/CSparse.NET/pull/18/commits/4e401e9de95e7640ad5ade1d639d562bf01aca9b, the unit test does nothing but calling the usual Multiply
method. To have n > block_size
, you'd better use
var A = ResourceLoader.Get<double>("general-40x40.mat");
var B = ResourceLoader.Get<double>("general-40x20.mat");
and then compare the results of the serial and parallel multiplication. Or feel free to add your own test data.
Do you have some benchmark code ready to run? Publishing a gist would save me some time!
Hello Friends, Just wanted to say it would be wonderful if have parallel multiplication and parallel transpose multiplication of matrix vector too. As Matrix Vector multiplication is much simpler than Matrix * Matrix I am able to help on it. Also can make the class for precise benchmark comparison.
Thanks
@epsi1on Parallel (sparse) matrix-vector multiplication is not simpler, because, for CSC storage you can't efficiently parallelize on the rows (computing A*x
) and for CSR storage you can't parallelize on the columns (computing A^t*x
).
To compute both products efficiently, you usually introduce other storage formats (see http://people.cs.georgetown.edu/~jfineman/papers/csb.pdf). For symmetric matrices, the approach can be simplified (see https://www.kkourt.io/papers/ipdps13.pdf).
EDIT: if you'd like to discuss this subject further, please open a separate issue.
I determined heuristics for deciding whether parallel matrix multiplication may be beneficial. An estimate of the number of operations in Scatter
is used as a threshold.
I used this benchmark to determine a reasonable threshold: https://gist.github.com/andreasg123/442ec96cfe7fe24d8b17dc134e69f700
I believe this should be the final version. I determined the threshold for deciding on parallel vs. sequential multiplication by trying many different thresholds on the benchmark output by replacing times for parallel multiplication with those for sequential multiplication when the total operations were below the threshold. That provided the threshold that has the least total elapsed time. I selected the overall best threshold by combining the results from different benchmark runs where the number of threads was restricted to 2, 4, and 48, respectively.
Please check issue #20 and tell me if you see an improvement running your benchmark.
I will add the change as a separate commit, since it's not directly connected to this pull request.
The method
ParallelMultiply
usesParallel.For
to speed up matrix multiplication by using multiple CPU cores.I tested it by multiplying randomly initialized 800x600 and 600x800 matrices with different fractions of non zeros. With a non-zero fraction of 0.5 and above, the parallel approach is about 6 times faster. The total CPU time overhead is about 60% at 0.7 (about 10 parallel threads) and goes up for smaller non-zero fractions, e.g., 95% at 0.4. The break-even point is at a non-zero fraction of 0.03 where both approaches take the same time. With a non-zero fraction of 0.01, the parallel approach takes about 4 times longer.
As the speed-up is larger with less sparse matrices, this would probably also benefit dense matrices.