Open sangallidavide opened 5 months ago
Bug fixed by moving the contained subroutine in X_redux.F to an independent subroutine
I am having the same problem. In which branch you splitted the X_redux?
The original branch is https://github.com/yambo-code/yambo-devel/tree/tech/devel-gpu However such branch is quite ahead of the develop. Probably the best is to see the gpl master
This is the commit: https://github.com/yambo-code/yambo/commit/7197a330399a9542d4178a5899b2ddbecbaec023
I realized the all past runs on eliud and mo with cuda failed not because of a buggy compilation but exactly because of a crash of cuSolver.
https://media.yambo-code.eu/robots/develop/eliud.kipchoge.2_develop_1_error.php
If these fails are connected to this bug that it should introduced ASAP in the bug-fixes.
The cusolver error does not affect tests like Al111/04_HF
So the situation on eliud is different.
Here the fails were likely due to the cuSolver: https://media.yambo-code.eu/robots/develop/mo.farah.4_develop_1_error.php
As you can see, for Al111
, 02_eels
fails, while 04_HF
is ok
The bug happens when running with GPU support (CUDAF)
Detected on my desktop (nvfortran 24.3, cuda 12.3) and on Leonardo (nvoftran 23.11, cuda 11.8 and 12.3)
Error message
Error code is
CUSOLVER_STATUS_EXECUTION_FAILED
https://docs.nvidia.com/cuda/cusolver/index.html(Sometimes it fails also before, at cuSoverDnCreate)