yambo-code / yambo

This is the official GPL repository of the yambo code
http://www.yambo-code.eu/
GNU General Public License v2.0
100 stars 39 forks source link

Fail of test 02_QP_PPA in LiF/GW-OPTICS with random parallelization using nvfortran #107

Open sangallidavide opened 4 months ago

sangallidavide commented 4 months ago

Error message is

 <02s> P4: [ERROR] Allocation of X_par%blc_d failed with code 1
P4: [ERROR] STOP signal received while in[05] Dynamic Dielectric Matrix (PPA)
P4: [ERROR] Not enough memory to allocate 0 bytes
mikeatm commented 3 months ago

I have met a similar problem when running BSE calcs with nvsdk 24.3 and cuda 12.3, when compiled with the slightly older (24.3) nvfortran, the 0 sized memory error would come from a failure of this kernel below, producing -1,-1,-1 for ln. src/wf_and_fft/fft_setup.F:99

But seems resolved on 24.5 and cuda 12.4, instead this new error happens on bug-fixes (f859a7f490ce5fe4b9ec20f0f45ac086b207c594) and maintenance-master (3d7b25df8ee6d4d924f9efc68bdb4a16af2be51a)

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x1554c5287540

Thread 1 "yambo" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 3080883, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 3, lane 0]
0x00001554c52875f0 in x_redux_x_redux_build_kernel_367_gpu
   <<<(2,11,1),(32,4,1)>>> ()

this i expect comes from this line

0x00001554c52a04b0 in x_redux_x_redux_build_kernel_367_gpu
   <<<(2,11,1),(32,4,1)>>> (iq=9)
    at /home/max/applications/yambo-5.2.1/src/pol_function/X_redux.F:368

This seems related to #120 , and roughly to #76.