Open jiangyinzuo opened 1 week ago
Hi, @jiangyinzuo,
Thank you for pointing out this issue. I have verified it, and it does exist. I will investigate the root cause; however, it may take some time as this code was written about three years ago. I truly appreciate your feedback and bringing this to my attention!
Yuanhang
Hi, @jiangyinzuo,
I have identified the root cause that leads to recall = 0
. Initially, I tested my code on an NVIDIA A2 GPU and encountered the same issue as you described. During the process of building the HNSW graph, the following warnings were output:
warning: Cuda API error detected: cudaLaunchCooperativeKernel returned (0x2d0)
warning: Cuda API error detected: cudaLaunchKernel returned (0x9)
The issue arises because the cooperative kernel launch exceeded certain hardware or configuration limits. I suspect that the parameters used in your test caused excessive shared memory usage within the kernel, leading to this problem.
Subsequently, I tested the same code on an NVIDIA L4 GPU. Not only the parameters mentioned in the README, i.e., ./build_128_l2 ../dataset/sift/base.fvecs hnsw 64 16
worked as expected to build the index and return correct query results, but also the parameters you use, i.e., ./build_128_l2 ../dataset/sift/base.fvecs hnsw 128 64
also worked successfully, returning accurate query results.
In summary, this issue likely stems from the resource requirements of the code exceeding the hardware capabilities of the GPU, rather than any fundamental problem with the code itself.
Thanks, my GPU is A100-PCIE-40GB
Mon Nov 18 12:01:38 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:82:00.0 Off | 0 |
| N/A 27C P0 31W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
and there is no warning when I build the HNSW graph on NVIDIA A100 GPU.
Hi, @jiangyinzuo,
This issue is related to shared memory
, not global memory
. You can try recompiling the program by changing FLAG_DEBUG=-O3
to FLAG_DEBUG=-g -G
in the Makefile
located in the template
folder. This will enable debug mode, allowing you to check if the same warning messages are displayed during execution.
I have tried FLAG_DEBUG=-g -G
, but there is still no warning:
./build_128_l2 ../sift1M/sift_base.fvecs hnsw 64 16
Load data points...
Construct proximity graph hnsw...
Parameters:
d_min = 16
d_max = 32
l_n = 64
the number of layers = 3
Running time: 263.403 seconds
Save proximity graph hnsw...
Done
I wrote a program to get cudaDeviceProp
of my GPU.
Tthe total amount of shared memory per block is 49152 bytes.
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <driver_types.h>
#include <stdio.h>
int main(int argc, char **argv) {
printf("%s Starting ...\n", argv[0]);
int deviceCount = 0;
cudaError_t error_id = cudaGetDeviceCount(&deviceCount);
if (error_id != cudaSuccess) {
printf("cudaGetDeviceCount returned %d\n ->%s\n", (int)error_id,
cudaGetErrorString(error_id));
printf("Result = FAIL\n");
exit(EXIT_FAILURE);
}
if (deviceCount == 0) {
printf("There are no available device(s) that support CUDA\n");
} else {
printf("Detected %d CUDA Capable device(s)\n", deviceCount);
}
int dev = 0, driverVersion = 0, runtimeVersion = 0;
cudaSetDevice(dev);
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, dev);
printf("Device %d:\"%s\"\n", dev, deviceProp.name);
cudaDriverGetVersion(&driverVersion);
cudaRuntimeGetVersion(&runtimeVersion);
printf(" CUDA Driver Version / Runtime Version %d.%d / %d.%d\n",
driverVersion / 1000, (driverVersion % 100) / 10,
runtimeVersion / 1000, (runtimeVersion % 100) / 10);
printf(" CUDA Capability Major/Minor version number: %d.%d\n",
deviceProp.major, deviceProp.minor);
printf(" Total amount of global memory: %.2f GBytes\n",
(float)deviceProp.totalGlobalMem / pow(1024.0, 3));
printf(
" GPU Clock rate: %.0f MHz (%0.2f GHz)\n",
deviceProp.clockRate * 1e-3f, deviceProp.clockRate * 1e-6f);
printf(" CUDA cores: %dx%d=%d\n",
deviceProp.multiProcessorCount, 0, 0);
printf(" Memory Bus width: %d-bits\n",
deviceProp.memoryBusWidth);
if (deviceProp.l2CacheSize) {
printf(" L2 Cache Size: %d bytes\n",
deviceProp.l2CacheSize);
}
printf(" Max Texture Dimension Size (x,y,z) "
"1D=(%d),2D=(%d,%d),3D=(%d,%d,%d)\n",
deviceProp.maxTexture1D, deviceProp.maxTexture2D[0],
deviceProp.maxTexture2D[1], deviceProp.maxTexture3D[0],
deviceProp.maxTexture3D[1], deviceProp.maxTexture3D[2]);
printf(" Max Layered Texture Size (dim) x layers 1D=(%d) x "
"%d,2D=(%d,%d) x %d\n",
deviceProp.maxTexture1DLayered[0], deviceProp.maxTexture1DLayered[1],
deviceProp.maxTexture2DLayered[0], deviceProp.maxTexture2DLayered[1],
deviceProp.maxTexture2DLayered[2]);
printf(" Total amount of constant memory %lu bytes\n",
deviceProp.totalConstMem);
printf(" Total amount of shared memory per block: %lu bytes\n",
deviceProp.sharedMemPerBlock);
printf(" Total number of registers available per block:%d\n",
deviceProp.regsPerBlock);
printf(" Warp size: %d\n",
deviceProp.warpSize);
printf(" maximum number of thread per multiprocesser: %d\n",
deviceProp.maxThreadsPerMultiProcessor);
printf(" maximum number of thread per block: %d\n",
deviceProp.maxThreadsPerBlock);
printf(" maximum size of each dimension of a block: %d x %d x %d\n",
deviceProp.maxThreadsDim[0], deviceProp.maxThreadsDim[1],
deviceProp.maxThreadsDim[2]);
printf(" maximum size of each dimension of a grid: %d x %d x %d\n",
deviceProp.maxGridSize[0], deviceProp.maxGridSize[1],
deviceProp.maxGridSize[2]);
printf(" maximum memory pitch %lu bytes\n",
deviceProp.memPitch);
int nDevices;
cudaGetDeviceCount(&nDevices);
for (int i = 0; i < nDevices; i++) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, i);
printf("Device Number: %d\n", i);
printf(" Device name: %s\n", prop.name);
printf(" Memory Clock Rate (KHz): %d\n", prop.memoryClockRate);
printf(" Memory Bus Width (bits): %d\n", prop.memoryBusWidth);
printf(" Peak Memory Bandwidth (GB/s): %f\n\n",
2.0 * prop.memoryClockRate * (prop.memoryBusWidth / 8) / 1.0e6);
printf(" Compute Capability: %d.%d\n", prop.major, prop.minor);
}
int deviceId = -1;
cudaGetDevice(&deviceId);
int smCount = 0;
cudaDeviceGetAttribute(&smCount, cudaDevAttrMultiProcessorCount, deviceId);
printf("Number of Streaming Multiprocessors: %d\n", smCount);
return 0;
}
Output
./gpudevice Starting ...
Detected 1 CUDA Capable device(s)
Device 0:"NVIDIA A100-PCIE-40GB"
CUDA Driver Version / Runtime Version 12.2 / 12.1
CUDA Capability Major/Minor version number: 8.0
Total amount of global memory: 39.39 GBytes
GPU Clock rate: 1410 MHz (1.41 GHz)
CUDA cores: 108x0=0
Memory Bus width: 5120-bits
L2 Cache Size: 41943040 bytes
Max Texture Dimension Size (x,y,z) 1D=(131072),2D=(131072,65536),3D=(16384,16384,16384)
Max Layered Texture Size (dim) x layers 1D=(32768) x 2048,2D=(32768,32768) x 2048
Total amount of constant memory 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block:65536
Warp size: 32
maximum number of thread per multiprocesser: 2048
maximum number of thread per block: 1024
maximum size of each dimension of a block: 1024 x 1024 x 64
maximum size of each dimension of a grid: 2147483647 x 65535 x 65535
maximum memory pitch 2147483647 bytes
Device Number: 0
Device name: NVIDIA A100-PCIE-40GB
Memory Clock Rate (KHz): 1215000
Memory Bus Width (bits): 5120
Peak Memory Bandwidth (GB/s): 1555.200000
Compute Capability: 8.0
Number of Streaming Multiprocessors: 108
Hi, @jiangyinzuo,
Could you please share the output of the command ./generate_build_instances.sh [dim] [metric]
?
jiangyinzuo@kiwi:/usr3/jiangyinzuo_data/GANNS$ ./generate_build_instances.sh 128 l2
dim=$1
dis=$2
DIR="build_instance"
if [ -d "$DIR" ]; then
rm -rf ${DIR}
fi
mkdir build_instance || true
cp template/*.h build_instance
cp template/*.cu build_instance
cp -R template/macro build_instance
cp -R template/graph_index build_instance
cp template/Makefile build_instance
cd build_instance
sed -i "s/PLACE_HOLDER_DIM/${dim}/g" kernel_local_graph_construction.h
if [ "${dis}" = "l2" ]; then
make build
elif [ "${dis}" = "cos" ]; then
make build DISTTYPE=USE_COS_DIST_
elif [ "${dis}" = "ip" ]; then
make build DISTTYPE=USE_IP_DIST_
fi
nvcc -ccbin g++ -I../../Common -rdc=true -m64 \
-gencode=arch=compute_60,code=sm_60 \
-gencode=arch=compute_61,code=sm_61 \
-gencode=arch=compute_70,code=sm_70 \
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_75,code=compute_75 \
-std=c++11 build.cu -g -G -o build -Xptxas -v \
-DUSE_L2_DIST_
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
ptxas info : 16 bytes gmem
ptxas info : Function properties for _ZSt4sqrtf
8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads
ptxas info : Function properties for _ZN10KernelPairIfiEC1Ev
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _ZN18cooperative_groups4__v117thread_group_baseILj3EEC2Ev
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Compiling entry function '_Z25DistanceMatrixComputationPfiiP10KernelPairIfiE' for 'sm_60'
ptxas info : Function properties for _Z25DistanceMatrixComputationPfiiP10KernelPairIfiE
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 53 registers, 344 bytes cmem[0]
ptxas info : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_' for 'sm_60'
ptxas info : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 52 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy' for 'sm_60'
ptxas info : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 150 registers, 400 bytes cmem[0]
ptxas info : Compiling entry function '_Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii' for 'sm_60'
ptxas info : Function properties for _Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii
288 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 79 registers, 356 bytes cmem[0]
ptxas info : Compiling entry function '_Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii' for 'sm_60'
ptxas info : Function properties for _Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 32 registers, 356 bytes cmem[0]
ptxas info : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiPy' for 'sm_60'
ptxas info : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiPy
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 144 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function '_Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii' for 'sm_60'
ptxas info : Function properties for _Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 27 registers, 348 bytes cmem[0]
ptxas info : Compiling entry function '_Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi' for 'sm_60'
ptxas info : Function properties for _Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 12 registers, 340 bytes cmem[0]
ptxas info : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_' for 'sm_60'
ptxas info : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 48 registers, 368 bytes cmem[0]
ptxas info : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy' for 'sm_60'
ptxas info : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 160 registers, 408 bytes cmem[0]
ptxas info : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiiS0_' for 'sm_60'
ptxas info : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiiS0_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 150 registers, 392 bytes cmem[0]
ptxas info : Function properties for _ZN18cooperative_groups4__v112thread_groupC2Ej
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __threadfence
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for clock64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __trap
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for sqrtf
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __ffs
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __uAtomicAdd
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d9atomicAddEPjj
16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d14__barrier_syncEj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d13__ballot_syncEji
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d16__shfl_down_syncEjfji
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : 16 bytes gmem
ptxas info : Function properties for _ZSt4sqrtf
8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads
ptxas info : Function properties for _ZN10KernelPairIfiEC1Ev
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _ZN18cooperative_groups4__v117thread_group_baseILj3EEC2Ev
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Compiling entry function '_Z25DistanceMatrixComputationPfiiP10KernelPairIfiE' for 'sm_61'
ptxas info : Function properties for _Z25DistanceMatrixComputationPfiiP10KernelPairIfiE
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 53 registers, 344 bytes cmem[0]
ptxas info : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_' for 'sm_61'
ptxas info : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 52 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy' for 'sm_61'
ptxas info : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 150 registers, 400 bytes cmem[0]
ptxas info : Compiling entry function '_Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii' for 'sm_61'
ptxas info : Function properties for _Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii
288 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 79 registers, 356 bytes cmem[0]
ptxas info : Compiling entry function '_Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii' for 'sm_61'
ptxas info : Function properties for _Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 32 registers, 356 bytes cmem[0]
ptxas info : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiPy' for 'sm_61'
ptxas info : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiPy
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 144 registers, 384 bytes cmem[0]
ptxas info : Compiling entry function '_Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii' for 'sm_61'
ptxas info : Function properties for _Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 27 registers, 348 bytes cmem[0]
ptxas info : Compiling entry function '_Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi' for 'sm_61'
ptxas info : Function properties for _Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 12 registers, 340 bytes cmem[0]
ptxas info : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_' for 'sm_61'
ptxas info : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 48 registers, 368 bytes cmem[0]
ptxas info : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy' for 'sm_61'
ptxas info : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 160 registers, 408 bytes cmem[0]
ptxas info : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiiS0_' for 'sm_61'
ptxas info : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiiS0_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 150 registers, 392 bytes cmem[0]
ptxas info : Function properties for _ZN18cooperative_groups4__v112thread_groupC2Ej
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __threadfence
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for clock64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __trap
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for sqrtf
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __ffs
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __uAtomicAdd
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d9atomicAddEPjj
16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d14__barrier_syncEj
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d13__ballot_syncEji
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d16__shfl_down_syncEjfji
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : 16 bytes gmem
ptxas info : Function properties for _ZSt4sqrtf
16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info : Function properties for _ZN10KernelPairIfiEC1Ev
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _ZN18cooperative_groups4__v117thread_group_baseILj3EEC2Ev
16 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads
ptxas info : Compiling entry function '_Z25DistanceMatrixComputationPfiiP10KernelPairIfiE' for 'sm_70'
ptxas info : Function properties for _Z25DistanceMatrixComputationPfiiP10KernelPairIfiE
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 57 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_' for 'sm_70'
ptxas info : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 54 registers, 408 bytes cmem[0]
ptxas info : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy' for 'sm_70'
ptxas info : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 154 registers, 432 bytes cmem[0]
ptxas info : Compiling entry function '_Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii' for 'sm_70'
ptxas info : Function properties for _Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii
288 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 87 registers, 388 bytes cmem[0]
ptxas info : Compiling entry function '_Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii' for 'sm_70'
ptxas info : Function properties for _Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 40 registers, 388 bytes cmem[0]
ptxas info : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiPy' for 'sm_70'
ptxas info : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiPy
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 152 registers, 416 bytes cmem[0]
ptxas info : Compiling entry function '_Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii' for 'sm_70'
ptxas info : Function properties for _Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 29 registers, 380 bytes cmem[0]
ptxas info : Compiling entry function '_Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi' for 'sm_70'
ptxas info : Function properties for _Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 372 bytes cmem[0]
ptxas info : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_' for 'sm_70'
ptxas info : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 50 registers, 400 bytes cmem[0]
ptxas info : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy' for 'sm_70'
ptxas info : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 168 registers, 440 bytes cmem[0]
ptxas info : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiiS0_' for 'sm_70'
ptxas info : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiiS0_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 154 registers, 424 bytes cmem[0]
ptxas info : Function properties for _ZN18cooperative_groups4__v112thread_groupC2Ej
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for clock64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __trap
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for sqrtf
16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info : Function properties for __ffs
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d14__barrier_syncEj
16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d13__ballot_syncEji
16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads
ptxas info : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d16__shfl_down_syncEjfji
32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads
ptxas info : 16 bytes gmem
ptxas info : Function properties for _ZSt4sqrtf
16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info : Function properties for _ZN10KernelPairIfiEC1Ev
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _ZN18cooperative_groups4__v117thread_group_baseILj3EEC2Ev
16 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads
ptxas info : Compiling entry function '_Z25DistanceMatrixComputationPfiiP10KernelPairIfiE' for 'sm_75'
ptxas info : Function properties for _Z25DistanceMatrixComputationPfiiP10KernelPairIfiE
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 57 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_' for 'sm_75'
ptxas info : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 55 registers, 408 bytes cmem[0]
ptxas info : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy' for 'sm_75'
ptxas info : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 154 registers, 432 bytes cmem[0]
ptxas info : Compiling entry function '_Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii' for 'sm_75'
ptxas info : Function properties for _Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii
288 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 87 registers, 388 bytes cmem[0]
ptxas info : Compiling entry function '_Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii' for 'sm_75'
ptxas info : Function properties for _Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 40 registers, 388 bytes cmem[0]
ptxas info : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiPy' for 'sm_75'
ptxas info : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiPy
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 152 registers, 416 bytes cmem[0]
ptxas info : Compiling entry function '_Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii' for 'sm_75'
ptxas info : Function properties for _Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 29 registers, 380 bytes cmem[0]
ptxas info : Compiling entry function '_Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi' for 'sm_75'
ptxas info : Function properties for _Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 372 bytes cmem[0]
ptxas info : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_' for 'sm_75'
ptxas info : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 51 registers, 400 bytes cmem[0]
ptxas info : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy' for 'sm_75'
ptxas info : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy
24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 168 registers, 440 bytes cmem[0]
ptxas info : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiiS0_' for 'sm_75'
ptxas info : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiiS0_
16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 154 registers, 424 bytes cmem[0]
ptxas info : Function properties for _ZN18cooperative_groups4__v112thread_groupC2Ej
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for clock64
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __trap
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for sqrtf
16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info : Function properties for __ffs
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d14__barrier_syncEj
16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d13__ballot_syncEji
16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads
ptxas info : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d16__shfl_down_syncEjfji
32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads
instance_name="build_${dim}_${dis}"
mv build ${instance_name}
cp ${instance_name} ..
rm ${instance_name}
my nvcc version: nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Hi, @jiangyinzuo,
It seems that the compiler g++
is defined multiple times. You can resolve this issue by explicitly setting the compiler with --compiler-bindir=/usr/bin/gcc-x
. By the way, could you let me know the version of g++ you are using?
jiangyinzuo@kiwi:/usr3/jiangyinzuo_data/GANNS$ g++ --version
g++ (conda-forge gcc 11.4.0-13) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Some Cuda APIs like cudaMallocHost(...)
return cudaError_t
, but the code ignores their return value. Maybe we can check these error?
Hi, @jiangyinzuo,
I'm sorry, I am unsure why this code does not run correctly on your server. I believe it is most likely due to hardware limitations rather than an issue with the code itself. Unfortunately, I cannot provide further suggestions as I am unable to reproduce the issue you are encountering.
Thank you for your suggestions! I will try it on my GPU server again, and when I gather more useful debugging information, I will come back for your assistance.
Hi, @jiangyinzuo,
Okay.
Here are my command line inputs: