Parallel multiplication of sparse matrices

wo80 / CSparse.NET

A concise library for solving sparse linear systems with direct methods.

GNU Lesser General Public License v2.1

58 stars 25 forks source link

Parallel multiplication of sparse matrices #18

Closed andreasg123 closed 5 years ago

andreasg123 commented 5 years ago

The method ParallelMultiply uses Parallel.For to speed up matrix multiplication by using multiple CPU cores.

I tested it by multiplying randomly initialized 800x600 and 600x800 matrices with different fractions of non zeros. With a non-zero fraction of 0.5 and above, the parallel approach is about 6 times faster. The total CPU time overhead is about 60% at 0.7 (about 10 parallel threads) and goes up for smaller non-zero fractions, e.g., 95% at 0.4. The break-even point is at a non-zero fraction of 0.03 where both approaches take the same time. With a non-zero fraction of 0.01, the parallel approach takes about 4 times longer.

As the speed-up is larger with less sparse matrices, this would probably also benefit dense matrices.

andreasg123 commented 5 years ago

The updated version uses small matrix slices instead of single columns. That improves the performance such that at a non-zero fraction of 0.01, the parallel version only takes 20% longer than the sequential version. Larger slices than the selected 16 columns at a time would reduce the parallelism such that the benefit for less sparse matrices would be reduced.

wo80 commented 5 years ago

Thanks. This looks good!

With your latest commit https://github.com/wo80/CSparse.NET/pull/18/commits/4e401e9de95e7640ad5ade1d639d562bf01aca9b, the unit test does nothing but calling the usual Multiply method. To have n > block_size, you'd better use

var A = ResourceLoader.Get<double>("general-40x40.mat");
var B = ResourceLoader.Get<double>("general-40x20.mat");

and then compare the results of the serial and parallel multiplication. Or feel free to add your own test data.

Do you have some benchmark code ready to run? Publishing a gist would save me some time!

epsi1on commented 5 years ago

Hello Friends, Just wanted to say it would be wonderful if have parallel multiplication and parallel transpose multiplication of matrix vector too. As Matrix Vector multiplication is much simpler than Matrix * Matrix I am able to help on it. Also can make the class for precise benchmark comparison.

Thanks

wo80 commented 5 years ago

@epsi1on Parallel (sparse) matrix-vector multiplication is not simpler, because, for CSC storage you can't efficiently parallelize on the rows (computing A*x) and for CSR storage you can't parallelize on the columns (computing A^t*x).

To compute both products efficiently, you usually introduce other storage formats (see http://people.cs.georgetown.edu/~jfineman/papers/csb.pdf). For symmetric matrices, the approach can be simplified (see https://www.kkourt.io/papers/ipdps13.pdf).

EDIT: if you'd like to discuss this subject further, please open a separate issue.

andreasg123 commented 5 years ago

I determined heuristics for deciding whether parallel matrix multiplication may be beneficial. An estimate of the number of operations in Scatter is used as a threshold.

I used this benchmark to determine a reasonable threshold: https://gist.github.com/andreasg123/442ec96cfe7fe24d8b17dc134e69f700

andreasg123 commented 5 years ago

I believe this should be the final version. I determined the threshold for deciding on parallel vs. sequential multiplication by trying many different thresholds on the benchmark output by replacing times for parallel multiplication with those for sequential multiplication when the total operations were below the threshold. That provided the threshold that has the least total elapsed time. I selected the overall best threshold by combining the results from different benchmark runs where the number of threads was restricted to 2, 4, and 48, respectively.

wo80 commented 5 years ago

Please check issue #20 and tell me if you see an improvement running your benchmark.

I will add the change as a separate commit, since it's not directly connected to this pull request.