Open jeanlucf22 opened 1 year ago
How many MPIs are you using?
1
I can reproduce the issue on Summit at OLCF. SuperLU build:
module load gcc/9.3.0 module load parmetis/4.0.3 module load metis module load cuda module load cmake module load essl
rm -rf build mkdir build cd build
export PARMETIS_ROOT=$OLCF_PARMETIS_ROOT export METIS_DIR=$OLCF_METIS_ROOT export CUDA_BIN_PATH=$CUDA_PATH export CUDAToolkit_ROOT=$CUDA_PATH export CMAKE_PREFIX_PATH=${CMAKE_PREFIX_PATH}:${OPENMPI_ROOT}
cmake .. \ -DCMAKE_BUILD_TYPE=Release \ -DTPL_ENABLE_LAPACKLIB=on \ -DCMAKE_C_FLAGS="-std=c99 -fPIC -DPRNTlevel=1 -DPROFlevel=1 -DGPU_SOLVE" \ -DTPL_ENABLE_CUDALIB=on \ -DCMAKE_C_COMPILER=mpicc \ -DCMAKE_CXX_COMPILER=mpicxx \ -DCMAKE_Fortran_COMPILER=mpifort \ -DCMAKE_CUDA_COMPILER=${CUDA_BIN_PATH}/bin/nvcc \ -D TPL_ENABLE_CUDALIB:BOOL=ON \ -D CUDA_CUBLAS_LIBRARIES="${CUDA_BIN_PATH}/lib64/libcublas.so" \ -D CMAKE_CUDA_ARCHITECTURES="70" \ -D CMAKE_CUDA_HOST_COMPILER=mpicxx \ -D CMAKE_CUDA_FLAGS:STRING="-ccbin mpicxx" \ -DTPL_ENABLE_PARMETISLIB=on \ -DTPL_PARMETIS_INCLUDE_DIRS="${PARMETIS_ROOT}/include;${METIS_DIR}/include" \ -DTPL_PARMETIS_LIBRARIES="${PARMETIS_ROOT}/lib/libparmetis.so;${METIS_DIR}/lib/libmetis.so" \ -DTPL_ENABLE_INTERNAL_BLASLIB=OFF \ -DXSDK_ENABLE_Fortran=OFF \ -DBUILD_SHARED_LIBS=on \ -DCMAKE_INSTALL_PREFIX=.
Run script:
export SUPERLU_ACC_OFFLOAD=1 export OMP_NUM_THREADS=1
cd build make test
Result:
Test project /ccs/home/jeanluc/GIT/superlu_dist/build Start 1: pdtest_1x1_1_2_8_20_SP 1/27 Test #1: pdtest_1x1_1_2_8_20_SP ...........***Timeout 1500.16 sec
Thanks for providing these helpful instructions and I can reproduce the issue now. The problem was calling pdgssvx with nrhs=0 will skip some setups for GPU solves, which causes hanging when calling it later with nrhs>0 and options->Fact=FACTORED. This commit should fix the problem: https://github.com/xiaoyeli/superlu_dist/commit/1aa8e658586abed61bed519aeee2a20bec99e0d7
However, the GPU solve in the master branch only support nmpi=1. You will still see the failures reported by "make test" when mpirun -np >1. I recommend not enabling GPU solve for the smoke/regression tests.
When I build superlu_dist with -DGPU_SOLVE in the C flags, the test suites seems to fail after printing out .. B to X redistribute time 0.0001 .. Setup L-solve time 0.0000 .. L-solve time 0.0003 .. L-solve time (MAX) 0.0003 .. Setup U-solve time 0.0000