Open ShiminZhang21 opened 3 months ago
@ShiminZhang21 We believe this is a problem of the NVIDIA Fortran compiler, or its optimizer to be precise. We have implemented a workaround to be released in the next version of WEST. For now you can recompile the code with reduced optimization -O1
. In our tests, the segmentation fault only occurs with -O2
or higher.
Thank you so much for the suggestion!
If i understand correctly i should add -O1
to CUDA_F90FLAGS in make.inc of QE? should i also add this option to LDFLAGS ?
Just search for -fast
then replace with -O1
.
Thank you! i have tested the setting and it can run through the large system now. Although it seems significantly slower with lower levels of optimization, but it is expected.
@vyu16
Hi victor,
i can run through the wstat.x for the dielectric using the compilation you suggested before. however when i try to run the gw using wfreq.x i got this error message related to FFT:
Failing in Thread:1 Accelerator Fatal Error: call to cuStreamSynchronize returned error 700: Illegal address during kernel execution File: /global/common/software/m4507/szhan213/WEST_GPU/qe-7.3-west6.0_compile2/West/FFT_kernel/fft_at_k.f90 Function: single_invfft_k:32 Line: 64
I use the same parallel setting as wstat.x .
Is there any suggestion what could possible help me figure out what is the problem?
Thanks, Shimin
Can you share the input and output files and the job script please
Sure i attached the the job script, input, outputs. I just realized my qp_bandrange is out of the range of scf calculated bands. That could be the problem so i'm re testing it.
NV.zip
Yes that would have led to a crash. In the next release the code will catch such errors.
Hi west team,
I been having a problem on running gpu version of west wstat.x on NERSC Perlmutter . I have no problem when running a small system, but I can’t run through a single calculation for big systems. No matter what parallel setting I tried, it crash at certain point with segmentation fault.
I attached my test file for a ZnO 192 atoms supercell with different parallel using npdep=2496. "Compile_west_gpu_v1.sh “ and "Compile_west_gpu_v2.sh “ are two compilation script I tried. “ZnO_wstat2496/Nni_” are the parallel tests with N=number of gpus, ni = -ni parallel setting for wstat.x “ZnO_wstat_2496/slurm.out.reports” is the report of slurm error message. When there’s no memory issue, the segmentation fault problem always appear. “ZnO_wstat_2496/wstat.out.reports” is the report of where the wstat.out end at. Some end at starting , some end at 70%.
Beside the ZnO 4x4x3 supercell, I also tested other systems like 161 atoms VB- in hBN. The similar issue appears.
Do you have any idea on solving this problem?
Seg_fault.zip