yambo-code / yambo

This is the official GPL repository of the yambo code
http://www.yambo-code.eu/
GNU General Public License v2.0
98 stars 39 forks source link

Memory in the SLEPC solver #125

Open attacc opened 3 years ago

attacc commented 3 years ago

Dear all

I noticed that the SLEPC solver usually needs more memory than the BSE construction. When I run a bit BSE calcuation the code crash at the beginning of the solver, than I start it again with more memory and it works.

It should be nice to have the possibility to define a different number of processors for the SLEPC what do you think?

sangallidavide commented 3 years ago

Notice that, when running in parallel, you can decide if you use the faster algorithm which duplicates memory, or the slower one which distributes memory. The latter should be the default.

http://www.yambo-code.org/wiki/index.php?title=File:Yambo-Cheatsheet-5.0_P20.png

Were you running with the distribute memory algorithm?

With the distributed algorithm, it should re-distribute the BS_kernel on the Shell matrix from the yambo to the slepc parallelization. I think it does a duplication. In the file https://github.com/yambo-code/yambo-devel/blob/develop/src/bse/K_shell_matrix.F there is this comment by @henriquemiranda

 !
 ! Allocate slepc shell matrix
 !
 ! We let petsc decide which part of the matrix in each core.
 ! TODO: In the future it should be done acording to the BS parallelization
 ! to avoid the scattering vi (distributed) -> x (local) in K_multiply_by_V_slepc

Maybe working on that could improve the memory distribution (?)

attacc commented 3 years ago

Dear Davide

I use the slower algorithm that uses less memory, I get the message: "Slower alogorithm but BSE matrix distributed over MPI tasks"

my typical run is on 120 cores, with parallelizzation BS_CPU= "24 2 2"

than at the SLEPC moment, it crashes, and I run it again using only 12 cores with 3 threads.... and SLEPC is very fast

I will have a look to that part

best Claudio

sangallidavide commented 3 years ago

I see. I think there is no easy workaround besides interrupting the run and restarting with a different number of MPI tasks (thanks to parallel I/O it is possible!). The same issue is with Haydock for example. I do not see easy alternative solutions ... yambopy is possible a good tool to handle this kind of situations

attacc commented 3 years ago

Davide I think a possible workaround it to use the number of processors of the Diagonalization/inversion for SLEPC and Haydock too

BS_nCPU_LinAlg_INV=-1 # [PARALLEL] CPUs for Linear Algebra (if -1 it is automatically set) BS_nCPU_LinAlg_DIAGO=-1 # [PARALLEL] CPUs for Linear Algebra (if -1 it is automatically set)

or define a new variable. The matrix can be written on disk and then loaded again with this new parallelization. (After vacation)

sangallidavide commented 3 years ago

As written it would be the same for the Haydock solver, indeed slepc use the Haydock subroutine K_multiply_by_V. At present the parallelization scheme of Haydock is defined based on the BSE parallelization scheme, I do not see an easy way to redefine it based on BS_nCPU_LinAlg_DIAGO or similar

The memory issue maybe due to the definition of the slepc_mat, and this is something I do not quite understand. In the shell case it is defined but never really used (?). This is also related to the idea of using I/O. I think the slepc_mat is not really loaded ever ...

sangallidavide commented 1 month ago

I recently tested the memory use in slepc.

It is indeed growing with the number of MPI task. This might be the reason why, at present, we are not able to check issue #88 on GPUs. As discussed by the developers of Slepc, this was make worse with the GPU porting and very recently improved again in a development version. Beside this change, the scaling becomes much better for larger matrices.

Results.

BSE ~1.5 GB matrix (N~10^4) a) Very bad memory scaling after GPU porting. Leads to

b) Improved memory scaling (still growing) leads to

The memory scaling improves significantly for larger matrices. BSE ~10 GB (N~3 10^4) b') Improved memory scaling (still growing) leads to