yambo-code / yambo

This is the official GPL repository of the yambo code
http://www.yambo-code.eu/
GNU General Public License v2.0
98 stars 38 forks source link

cuSolver fails with nvfortran >= 23.11 #76

Open sangallidavide opened 5 months ago

sangallidavide commented 5 months ago

The bug happens when running with GPU support (CUDAF)

Detected on my desktop (nvfortran 24.3, cuda 12.3) and on Leonardo (nvoftran 23.11, cuda 11.8 and 12.3)

Error message

[ERROR] STOP signal received while in[04] Optics
[ERROR] LINEAR ALGEBRA driver [SERIAL_lin_system_gpu]cusolverDnCgetrs failed

Error code is CUSOLVER_STATUS_EXECUTION_FAILED https://docs.nvidia.com/cuda/cusolver/index.html

(Sometimes it fails also before, at cuSoverDnCreate)

sangallidavide commented 5 months ago

Bug fixed by moving the contained subroutine in X_redux.F to an independent subroutine

andreamarini commented 5 months ago

I am having the same problem. In which branch you splitted the X_redux?

sangallidavide commented 5 months ago

The original branch is https://github.com/yambo-code/yambo-devel/tree/tech/devel-gpu However such branch is quite ahead of the develop. Probably the best is to see the gpl master

This is the commit: https://github.com/yambo-code/yambo/commit/7197a330399a9542d4178a5899b2ddbecbaec023

andreamarini commented 5 months ago

I realized the all past runs on eliud and mo with cuda failed not because of a buggy compilation but exactly because of a crash of cuSolver.

https://media.yambo-code.eu/robots/develop/eliud.kipchoge.2_develop_1_error.php

If these fails are connected to this bug that it should introduced ASAP in the bug-fixes.

sangallidavide commented 5 months ago

The cusolver error does not affect tests like Al111/04_HF So the situation on eliud is different.

sangallidavide commented 5 months ago

Here the fails were likely due to the cuSolver: https://media.yambo-code.eu/robots/develop/mo.farah.4_develop_1_error.php

As you can see, for Al111, 02_eels fails, while 04_HF is ok