device::spmm_coo requires huge anount of memory

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Apply the smoothed_aggregation preconditioner to a big enough matrix.

What is the expected output? What do you see instead?
The solver fails with bad::alloc because the GPU memory is exhausted by 
cusp::detail::device::spmm_coo.

What version of the product are you using? On what operating system?
cusp v0.1.1
thrust v1.3.0
nvidia cuda 3.0
windows xp professional x64 edition

Please provide any additional information below.
I think, the problem is general, but if you need, I can upload my particular 
matrix. It only has 8M of nonzeros and fails on a GTX285 card which has 1GB RAM.

I wonder if the spmm algorithm could be optimized or avoided in 
smoothed_aggregation by keeping the original matrices, let's say A and B, and 
computing y=A*(B*x) instead of y=(A*B)*x?

Thanks!

Original issue reported on code.google.com by agnonc...@gmail.com on 11 Oct 2010 at 9:22

GoogleCodeExporter commented 8 years ago

Thanks for the report.

Yes, this is a known problem with the current implementation and we intend to 
fix it for Cusp v0.2.

In the context of smoothed aggregation it is necessary to compute the sparse 
matrix-matrix products explicitly since we must have a sparse matrix 
representation of the coarse level operator A_coarse = R * A * P.  It's 
possible that some reformulation of the operation is cheaper though.

Original comment by wnbell on 11 Oct 2010 at 10:16

Added labels: Milestone-Release0.2

GoogleCodeExporter commented 8 years ago

Thanks for the reply!

Right, one cannot avoid matrix-matrix multiplication in a smoothed aggregation 
algorithm. Though, for its non-smoothed counterpart, there could be a way to 
compute P^T * A * P in a more efficient manner than a general matrix-matrix 
product. I mean, one can take advantage of a specific sparsity pattern of P. 
When P is a tentative prolongator, P^T * A * P can be easily computed through 
aggregation of corresponding entries of the matrix A.

But, I know, a non-smoothed approach may degrade the convergence rate of MG.
Anyway, looking forward to v0.2! And, of cause, a multi-GPU version could solve 
the memory issue, but a very long way to go, I think.

Original comment by agnonc...@gmail.com on 13 Oct 2010 at 2:02

GoogleCodeExporter commented 8 years ago

Revision 7244fa8d6c improves this problem.  If you have a GPU with less than 1 
GB of memory you can reduce this constant [1] from 16M to something like 2M or 
4M to conserve memory.

As you say, you can cheat a little when P is unsmoothed.  This routine uses a 
more efficient matrix-matrix multiplication scheme for the (unsmoothed) 
tentative prolongator [2].

[1] 
http://code.google.com/p/cusp-library/source/browse/cusp/detail/device/spmm/coo.
h#208
[2] 
http://code.google.com/p/cusp-library/source/browse/cusp/precond/detail/smoothed
_aggregation.inl#133

Original comment by wnbell on 9 Nov 2010 at 6:14

Changed state: Started

GoogleCodeExporter commented 8 years ago

Yep! Thanks for the improvement. I've tested the latest version of cusp on my 
matrix which has 8M nonzeros and it worked just fine even with 16М of 
workspace. A bigger (12M nonzeros) matrix required the workspace to be reduced 
to 8M. Decreasing the amount of workspace slightly degrades the spmm 
performance - not a big deal from my point of view.

Just for your info, I couldn't compile the fresh ritz routines with VS8 (see 
the attached log-file). Switching to the old estimate_spectral_radius resolved 
the issue.

Original comment by agnonc...@gmail.com on 11 Nov 2010 at 4:35

Attachments:

compile.log

GoogleCodeExporter commented 8 years ago

This issue was closed by revision 8f5fdb8195.

Original comment by wnbell on 14 Feb 2011 at 8:31

Changed state: Fixed

xiaotaiye / cusp-library

device::spmm_coo requires huge anount of memory #31