Open attacc opened 3 years ago
Notice that, when running in parallel, you can decide if you use the faster algorithm which duplicates memory, or the slower one which distributes memory. The latter should be the default.
http://www.yambo-code.org/wiki/index.php?title=File:Yambo-Cheatsheet-5.0_P20.png
Were you running with the distribute memory algorithm?
With the distributed algorithm, it should re-distribute the BS_kernel on the Shell matrix from the yambo to the slepc parallelization. I think it does a duplication. In the file https://github.com/yambo-code/yambo-devel/blob/develop/src/bse/K_shell_matrix.F there is this comment by @henriquemiranda
!
! Allocate slepc shell matrix
!
! We let petsc decide which part of the matrix in each core.
! TODO: In the future it should be done acording to the BS parallelization
! to avoid the scattering vi (distributed) -> x (local) in K_multiply_by_V_slepc
Maybe working on that could improve the memory distribution (?)
Dear Davide
I use the slower algorithm that uses less memory, I get the message: "Slower alogorithm but BSE matrix distributed over MPI tasks"
my typical run is on 120 cores, with parallelizzation BS_CPU= "24 2 2"
than at the SLEPC moment, it crashes, and I run it again using only 12 cores with 3 threads.... and SLEPC is very fast
I will have a look to that part
best Claudio
I see. I think there is no easy workaround besides interrupting the run and restarting with a different number of MPI tasks (thanks to parallel I/O it is possible!). The same issue is with Haydock for example. I do not see easy alternative solutions ... yambopy is possible a good tool to handle this kind of situations
Davide I think a possible workaround it to use the number of processors of the Diagonalization/inversion for SLEPC and Haydock too
BS_nCPU_LinAlg_INV=-1 # [PARALLEL] CPUs for Linear Algebra (if -1 it is automatically set) BS_nCPU_LinAlg_DIAGO=-1 # [PARALLEL] CPUs for Linear Algebra (if -1 it is automatically set)
or define a new variable. The matrix can be written on disk and then loaded again with this new parallelization. (After vacation)
As written it would be the same for the Haydock solver, indeed slepc use the Haydock subroutine K_multiply_by_V. At present the parallelization scheme of Haydock is defined based on the BSE parallelization scheme, I do not see an easy way to redefine it based on BS_nCPU_LinAlg_DIAGO or similar
The memory issue maybe due to the definition of the slepc_mat, and this is something I do not quite understand. In the shell case it is defined but never really used (?). This is also related to the idea of using I/O. I think the slepc_mat is not really loaded ever ...
I recently tested the memory use in slepc.
It is indeed growing with the number of MPI task. This might be the reason why, at present, we are not able to check issue #88 on GPUs. As discussed by the developers of Slepc, this was make worse with the GPU porting and very recently improved again in a development version. Beside this change, the scaling becomes much better for larger matrices.
Results.
BSE ~1.5 GB matrix (N~10^4
)
a) Very bad memory scaling after GPU porting. Leads to
b) Improved memory scaling (still growing) leads to
The memory scaling improves significantly for larger matrices.
BSE ~10 GB (N~3 10^4
)
b') Improved memory scaling (still growing) leads to
Dear all
I noticed that the SLEPC solver usually needs more memory than the BSE construction. When I run a bit BSE calcuation the code crash at the beginning of the solver, than I start it again with more memory and it works.
It should be nice to have the possibility to define a different number of processors for the SLEPC what do you think?