yuyuanhang / GANNS

Article: GPU-accelerated Proximity Graph Approximate Nearest Neighbor Search and Construction by Authors Yuanhang Yu, Dong Wen, Ying Zhang, Lu Qin, Wenjie Zhang and Xuemin Lin
15 stars 3 forks source link

0 Recall when executing HNSW query #2

Open jiangyinzuo opened 1 week ago

jiangyinzuo commented 1 week ago

Here are my command line inputs:

./generate_build_instances.sh 128 l2
./generate_query_instances.sh 128 l2
./build_128_l2 ../sift1M/sift_base.fvecs hnsw 128 64
Load data points...

Construct proximity graph hnsw...

Parameters:
           d_min = 64
           d_max = 128
           l_n = 128
           the number of layers = 2

Running time: 58.1736 seconds
Save proximity graph hnsw...

Done
./query_128_l2 ../sift1M/sift_base.fvecs ../sift1M/sift_query.fvecs hnsw ../sift1M/base.fbin_128_64.hnsw ../sift1M/sift_groundtruth.ivecs 64 10
Load groundtruth...

Load data points and query points...

Load proximity graph...

Search...

Parameters:
           the number of topk = 16
           e = 64

Query speed: 64701.9 queries per second
Recall: 0
./query_128_l2 ../sift1M/sift_base.fvecs ../sift1M/sift_query.fvecs hnsw ../sift1M/base.fbin_128_64.hnsw ../sift1M/sift_groundtruth.ivecs 128 10
Load groundtruth...

Load data points and query points...

Load proximity graph...

Search...

Parameters:
           the number of topk = 16
           e = 128

Query speed: 64744.6 queries per second
Recall: 0
yuyuanhang commented 1 week ago

Hi, @jiangyinzuo,

Thank you for pointing out this issue. I have verified it, and it does exist. I will investigate the root cause; however, it may take some time as this code was written about three years ago. I truly appreciate your feedback and bringing this to my attention!

Yuanhang

yuyuanhang commented 1 week ago

Hi, @jiangyinzuo,

I have identified the root cause that leads to recall = 0. Initially, I tested my code on an NVIDIA A2 GPU and encountered the same issue as you described. During the process of building the HNSW graph, the following warnings were output:

warning: Cuda API error detected: cudaLaunchCooperativeKernel returned (0x2d0)
warning: Cuda API error detected: cudaLaunchKernel returned (0x9)

The issue arises because the cooperative kernel launch exceeded certain hardware or configuration limits. I suspect that the parameters used in your test caused excessive shared memory usage within the kernel, leading to this problem.

Subsequently, I tested the same code on an NVIDIA L4 GPU. Not only the parameters mentioned in the README, i.e., ./build_128_l2 ../dataset/sift/base.fvecs hnsw 64 16 worked as expected to build the index and return correct query results, but also the parameters you use, i.e., ./build_128_l2 ../dataset/sift/base.fvecs hnsw 128 64 also worked successfully, returning accurate query results.

In summary, this issue likely stems from the resource requirements of the code exceeding the hardware capabilities of the GPU, rather than any fundamental problem with the code itself.

jiangyinzuo commented 1 week ago

Thanks, my GPU is A100-PCIE-40GB

Mon Nov 18 12:01:38 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:82:00.0 Off |                    0 |
| N/A   27C    P0              31W / 250W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

and there is no warning when I build the HNSW graph on NVIDIA A100 GPU.

yuyuanhang commented 1 week ago

Hi, @jiangyinzuo,

This issue is related to shared memory, not global memory. You can try recompiling the program by changing FLAG_DEBUG=-O3 to FLAG_DEBUG=-g -G in the Makefile located in the template folder. This will enable debug mode, allowing you to check if the same warning messages are displayed during execution.

jiangyinzuo commented 1 week ago

I have tried FLAG_DEBUG=-g -G, but there is still no warning:

./build_128_l2 ../sift1M/sift_base.fvecs hnsw 64 16
Load data points...

Construct proximity graph hnsw...

Parameters:
           d_min = 16
           d_max = 32
           l_n = 64
           the number of layers = 3

Running time: 263.403 seconds
Save proximity graph hnsw...

Done

I wrote a program to get cudaDeviceProp of my GPU. Tthe total amount of shared memory per block is 49152 bytes.

#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <driver_types.h>
#include <stdio.h>

int main(int argc, char **argv) {
  printf("%s Starting ...\n", argv[0]);
  int deviceCount = 0;
  cudaError_t error_id = cudaGetDeviceCount(&deviceCount);
  if (error_id != cudaSuccess) {
    printf("cudaGetDeviceCount returned %d\n ->%s\n", (int)error_id,
           cudaGetErrorString(error_id));
    printf("Result = FAIL\n");
    exit(EXIT_FAILURE);
  }
  if (deviceCount == 0) {
    printf("There are no available device(s) that support CUDA\n");
  } else {
    printf("Detected %d CUDA Capable device(s)\n", deviceCount);
  }
  int dev = 0, driverVersion = 0, runtimeVersion = 0;
  cudaSetDevice(dev);
  cudaDeviceProp deviceProp;
  cudaGetDeviceProperties(&deviceProp, dev);
  printf("Device %d:\"%s\"\n", dev, deviceProp.name);
  cudaDriverGetVersion(&driverVersion);
  cudaRuntimeGetVersion(&runtimeVersion);
  printf("  CUDA Driver Version / Runtime Version         %d.%d  /  %d.%d\n",
         driverVersion / 1000, (driverVersion % 100) / 10,
         runtimeVersion / 1000, (runtimeVersion % 100) / 10);
  printf("  CUDA Capability Major/Minor version number:   %d.%d\n",
         deviceProp.major, deviceProp.minor);
  printf("  Total amount of global memory:                %.2f GBytes\n",
         (float)deviceProp.totalGlobalMem / pow(1024.0, 3));
  printf(
      "  GPU Clock rate:                               %.0f MHz (%0.2f GHz)\n",
      deviceProp.clockRate * 1e-3f, deviceProp.clockRate * 1e-6f);
  printf("  CUDA cores:                                   %dx%d=%d\n",
         deviceProp.multiProcessorCount, 0, 0);
  printf("  Memory Bus width:                             %d-bits\n",
         deviceProp.memoryBusWidth);
  if (deviceProp.l2CacheSize) {
    printf("  L2 Cache Size:                                    %d bytes\n",
           deviceProp.l2CacheSize);
  }
  printf("  Max Texture Dimension Size (x,y,z)            "
         "1D=(%d),2D=(%d,%d),3D=(%d,%d,%d)\n",
         deviceProp.maxTexture1D, deviceProp.maxTexture2D[0],
         deviceProp.maxTexture2D[1], deviceProp.maxTexture3D[0],
         deviceProp.maxTexture3D[1], deviceProp.maxTexture3D[2]);
  printf("  Max Layered Texture Size (dim) x layers       1D=(%d) x "
         "%d,2D=(%d,%d) x %d\n",
         deviceProp.maxTexture1DLayered[0], deviceProp.maxTexture1DLayered[1],
         deviceProp.maxTexture2DLayered[0], deviceProp.maxTexture2DLayered[1],
         deviceProp.maxTexture2DLayered[2]);
  printf("  Total amount of constant memory               %lu bytes\n",
         deviceProp.totalConstMem);
  printf("  Total amount of shared memory per block:      %lu bytes\n",
         deviceProp.sharedMemPerBlock);
  printf("  Total number of registers available per block:%d\n",
         deviceProp.regsPerBlock);
  printf("  Warp size:                                    %d\n",
         deviceProp.warpSize);
  printf("  maximum number of thread per multiprocesser:  %d\n",
         deviceProp.maxThreadsPerMultiProcessor);
  printf("  maximum number of thread per block:           %d\n",
         deviceProp.maxThreadsPerBlock);
  printf("  maximum size of each dimension of a block:    %d x %d x %d\n",
         deviceProp.maxThreadsDim[0], deviceProp.maxThreadsDim[1],
         deviceProp.maxThreadsDim[2]);
  printf("  maximum size of each dimension of a grid:     %d x %d x %d\n",
         deviceProp.maxGridSize[0], deviceProp.maxGridSize[1],
         deviceProp.maxGridSize[2]);
  printf("  maximum memory pitch                           %lu bytes\n",
         deviceProp.memPitch);

  int nDevices;
  cudaGetDeviceCount(&nDevices);
  for (int i = 0; i < nDevices; i++) {
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, i);
    printf("Device Number: %d\n", i);
    printf("  Device name: %s\n", prop.name);
    printf("  Memory Clock Rate (KHz): %d\n", prop.memoryClockRate);
    printf("  Memory Bus Width (bits): %d\n", prop.memoryBusWidth);
    printf("  Peak Memory Bandwidth (GB/s): %f\n\n",
           2.0 * prop.memoryClockRate * (prop.memoryBusWidth / 8) / 1.0e6);
    printf("  Compute Capability: %d.%d\n", prop.major, prop.minor);
  }

  int deviceId = -1;
  cudaGetDevice(&deviceId);
  int smCount = 0;
  cudaDeviceGetAttribute(&smCount, cudaDevAttrMultiProcessorCount, deviceId);

  printf("Number of Streaming Multiprocessors: %d\n", smCount);

  return 0;
}

Output

./gpudevice Starting ...
Detected 1 CUDA Capable device(s)
Device 0:"NVIDIA A100-PCIE-40GB"
  CUDA Driver Version / Runtime Version         12.2  /  12.1
  CUDA Capability Major/Minor version number:   8.0
  Total amount of global memory:                39.39 GBytes
  GPU Clock rate:                               1410 MHz (1.41 GHz)
  CUDA cores:                                   108x0=0
  Memory Bus width:                             5120-bits
  L2 Cache Size:                                41943040 bytes
  Max Texture Dimension Size (x,y,z)            1D=(131072),2D=(131072,65536),3D=(16384,16384,16384)
  Max Layered Texture Size (dim) x layers       1D=(32768) x 2048,2D=(32768,32768) x 2048
  Total amount of constant memory               65536 bytes
  Total amount of shared memory per block:      49152 bytes
  Total number of registers available per block:65536
  Warp size:                                    32
  maximum number of thread per multiprocesser:  2048
  maximum number of thread per block:           1024
  maximum size of each dimension of a block:    1024 x 1024 x 64
  maximum size of each dimension of a grid:     2147483647 x 65535 x 65535
  maximum memory pitch                           2147483647 bytes
Device Number: 0
  Device name: NVIDIA A100-PCIE-40GB
  Memory Clock Rate (KHz): 1215000
  Memory Bus Width (bits): 5120
  Peak Memory Bandwidth (GB/s): 1555.200000

  Compute Capability: 8.0
Number of Streaming Multiprocessors: 108
yuyuanhang commented 1 week ago

Hi, @jiangyinzuo,

Could you please share the output of the command ./generate_build_instances.sh [dim] [metric]?

jiangyinzuo commented 1 week ago
jiangyinzuo@kiwi:/usr3/jiangyinzuo_data/GANNS$ ./generate_build_instances.sh 128 l2
dim=$1
dis=$2

DIR="build_instance"

if [ -d "$DIR" ]; then
    rm -rf ${DIR}
fi

mkdir build_instance || true

cp template/*.h build_instance
cp template/*.cu build_instance
cp -R template/macro build_instance
cp -R template/graph_index build_instance
cp template/Makefile build_instance

cd build_instance

sed -i "s/PLACE_HOLDER_DIM/${dim}/g" kernel_local_graph_construction.h

if [ "${dis}" = "l2" ]; then
        make build
elif [ "${dis}" = "cos" ]; then
        make build DISTTYPE=USE_COS_DIST_
elif [ "${dis}" = "ip" ]; then
        make build DISTTYPE=USE_IP_DIST_
fi
nvcc -ccbin g++ -I../../Common  -rdc=true -m64 \
-gencode=arch=compute_60,code=sm_60 \
-gencode=arch=compute_61,code=sm_61 \
-gencode=arch=compute_70,code=sm_70 \
-gencode=arch=compute_75,code=sm_75 \
-gencode=arch=compute_75,code=compute_75 \
-std=c++11 build.cu -g -G -o build -Xptxas -v \
-DUSE_L2_DIST_
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
ptxas info    : 16 bytes gmem
ptxas info    : Function properties for _ZSt4sqrtf
    8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads
ptxas info    : Function properties for _ZN10KernelPairIfiEC1Ev
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN18cooperative_groups4__v117thread_group_baseILj3EEC2Ev
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Compiling entry function '_Z25DistanceMatrixComputationPfiiP10KernelPairIfiE' for 'sm_60'
ptxas info    : Function properties for _Z25DistanceMatrixComputationPfiiP10KernelPairIfiE
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 53 registers, 344 bytes cmem[0]
ptxas info    : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_' for 'sm_60'
ptxas info    : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 52 registers, 376 bytes cmem[0]
ptxas info    : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy' for 'sm_60'
ptxas info    : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy
    24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 150 registers, 400 bytes cmem[0]
ptxas info    : Compiling entry function '_Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii' for 'sm_60'
ptxas info    : Function properties for _Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii
    288 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 79 registers, 356 bytes cmem[0]
ptxas info    : Compiling entry function '_Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii' for 'sm_60'
ptxas info    : Function properties for _Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 32 registers, 356 bytes cmem[0]
ptxas info    : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiPy' for 'sm_60'
ptxas info    : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiPy
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 144 registers, 384 bytes cmem[0]
ptxas info    : Compiling entry function '_Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii' for 'sm_60'
ptxas info    : Function properties for _Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 27 registers, 348 bytes cmem[0]
ptxas info    : Compiling entry function '_Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi' for 'sm_60'
ptxas info    : Function properties for _Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 12 registers, 340 bytes cmem[0]
ptxas info    : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_' for 'sm_60'
ptxas info    : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 48 registers, 368 bytes cmem[0]
ptxas info    : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy' for 'sm_60'
ptxas info    : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy
    24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 160 registers, 408 bytes cmem[0]
ptxas info    : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiiS0_' for 'sm_60'
ptxas info    : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiiS0_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 150 registers, 392 bytes cmem[0]
ptxas info    : Function properties for _ZN18cooperative_groups4__v112thread_groupC2Ej
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for __threadfence
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for clock64
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for __trap
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for sqrtf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for __ffs
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for __uAtomicAdd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d9atomicAddEPjj
    16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info    : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d14__barrier_syncEj
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d13__ballot_syncEji
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d16__shfl_down_syncEjfji
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : 16 bytes gmem
ptxas info    : Function properties for _ZSt4sqrtf
    8 bytes stack frame, 4 bytes spill stores, 4 bytes spill loads
ptxas info    : Function properties for _ZN10KernelPairIfiEC1Ev
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN18cooperative_groups4__v117thread_group_baseILj3EEC2Ev
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Compiling entry function '_Z25DistanceMatrixComputationPfiiP10KernelPairIfiE' for 'sm_61'
ptxas info    : Function properties for _Z25DistanceMatrixComputationPfiiP10KernelPairIfiE
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 53 registers, 344 bytes cmem[0]
ptxas info    : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_' for 'sm_61'
ptxas info    : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 52 registers, 376 bytes cmem[0]
ptxas info    : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy' for 'sm_61'
ptxas info    : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy
    24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 150 registers, 400 bytes cmem[0]
ptxas info    : Compiling entry function '_Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii' for 'sm_61'
ptxas info    : Function properties for _Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii
    288 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 79 registers, 356 bytes cmem[0]
ptxas info    : Compiling entry function '_Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii' for 'sm_61'
ptxas info    : Function properties for _Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 32 registers, 356 bytes cmem[0]
ptxas info    : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiPy' for 'sm_61'
ptxas info    : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiPy
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 144 registers, 384 bytes cmem[0]
ptxas info    : Compiling entry function '_Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii' for 'sm_61'
ptxas info    : Function properties for _Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 27 registers, 348 bytes cmem[0]
ptxas info    : Compiling entry function '_Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi' for 'sm_61'
ptxas info    : Function properties for _Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 12 registers, 340 bytes cmem[0]
ptxas info    : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_' for 'sm_61'
ptxas info    : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 48 registers, 368 bytes cmem[0]
ptxas info    : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy' for 'sm_61'
ptxas info    : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy
    24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 160 registers, 408 bytes cmem[0]
ptxas info    : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiiS0_' for 'sm_61'
ptxas info    : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiiS0_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 150 registers, 392 bytes cmem[0]
ptxas info    : Function properties for _ZN18cooperative_groups4__v112thread_groupC2Ej
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for __threadfence
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for clock64
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for __trap
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for sqrtf
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for __ffs
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for __uAtomicAdd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d9atomicAddEPjj
    16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info    : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d14__barrier_syncEj
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d13__ballot_syncEji
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d16__shfl_down_syncEjfji
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : 16 bytes gmem
ptxas info    : Function properties for _ZSt4sqrtf
    16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info    : Function properties for _ZN10KernelPairIfiEC1Ev
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN18cooperative_groups4__v117thread_group_baseILj3EEC2Ev
    16 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads
ptxas info    : Compiling entry function '_Z25DistanceMatrixComputationPfiiP10KernelPairIfiE' for 'sm_70'
ptxas info    : Function properties for _Z25DistanceMatrixComputationPfiiP10KernelPairIfiE
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 57 registers, 376 bytes cmem[0]
ptxas info    : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_' for 'sm_70'
ptxas info    : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 54 registers, 408 bytes cmem[0]
ptxas info    : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy' for 'sm_70'
ptxas info    : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy
    24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 154 registers, 432 bytes cmem[0]
ptxas info    : Compiling entry function '_Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii' for 'sm_70'
ptxas info    : Function properties for _Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii
    288 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 87 registers, 388 bytes cmem[0]
ptxas info    : Compiling entry function '_Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii' for 'sm_70'
ptxas info    : Function properties for _Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 40 registers, 388 bytes cmem[0]
ptxas info    : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiPy' for 'sm_70'
ptxas info    : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiPy
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 152 registers, 416 bytes cmem[0]
ptxas info    : Compiling entry function '_Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii' for 'sm_70'
ptxas info    : Function properties for _Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 29 registers, 380 bytes cmem[0]
ptxas info    : Compiling entry function '_Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi' for 'sm_70'
ptxas info    : Function properties for _Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 14 registers, 372 bytes cmem[0]
ptxas info    : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_' for 'sm_70'
ptxas info    : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 50 registers, 400 bytes cmem[0]
ptxas info    : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy' for 'sm_70'
ptxas info    : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy
    24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 168 registers, 440 bytes cmem[0]
ptxas info    : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiiS0_' for 'sm_70'
ptxas info    : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiiS0_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 154 registers, 424 bytes cmem[0]
ptxas info    : Function properties for _ZN18cooperative_groups4__v112thread_groupC2Ej
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for clock64
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for __trap
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for sqrtf
    16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info    : Function properties for __ffs
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d14__barrier_syncEj
    16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info    : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d13__ballot_syncEji
    16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads
ptxas info    : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d16__shfl_down_syncEjfji
    32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads
ptxas info    : 16 bytes gmem
ptxas info    : Function properties for _ZSt4sqrtf
    16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info    : Function properties for _ZN10KernelPairIfiEC1Ev
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN18cooperative_groups4__v117thread_group_baseILj3EEC2Ev
    16 bytes stack frame, 8 bytes spill stores, 8 bytes spill loads
ptxas info    : Compiling entry function '_Z25DistanceMatrixComputationPfiiP10KernelPairIfiE' for 'sm_75'
ptxas info    : Function properties for _Z25DistanceMatrixComputationPfiiP10KernelPairIfiE
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 57 registers, 376 bytes cmem[0]
ptxas info    : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_' for 'sm_75'
ptxas info    : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiES1_iPfiiiS1_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 55 registers, 408 bytes cmem[0]
ptxas info    : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy' for 'sm_75'
ptxas info    : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiiPy
    24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 154 registers, 432 bytes cmem[0]
ptxas info    : Compiling entry function '_Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii' for 'sm_75'
ptxas info    : Function properties for _Z15GlobalEdgesSortP10KernelPairIfiEP4EdgePiiii
    288 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 87 registers, 388 bytes cmem[0]
ptxas info    : Compiling entry function '_Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii' for 'sm_75'
ptxas info    : Function properties for _Z21AggragateForwardEdgesP10KernelPairIfiEP4EdgePiiii
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 40 registers, 388 bytes cmem[0]
ptxas info    : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiPy' for 'sm_75'
ptxas info    : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiPy
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 152 registers, 416 bytes cmem[0]
ptxas info    : Compiling entry function '_Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii' for 'sm_75'
ptxas info    : Function properties for _Z23ConvertNeighborstoGraphPiP10KernelPairIfiEiii
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 29 registers, 380 bytes cmem[0]
ptxas info    : Compiling entry function '_Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi' for 'sm_75'
ptxas info    : Function properties for _Z17LoadFirstSubgraphPSt4pairIfiEP10KernelPairIfiEi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 14 registers, 372 bytes cmem[0]
ptxas info    : Compiling entry function '_Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_' for 'sm_75'
ptxas info    : Function properties for _Z25SortNeighborsonLocalGraphP10KernelPairIfiEiPfiiiS1_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 51 registers, 400 bytes cmem[0]
ptxas info    : Compiling entry function '_Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy' for 'sm_75'
ptxas info    : Function properties for _Z18LocalGraphMergenceP10KernelPairIfiES1_iPfP4EdgeiiiiiiPiiiPy
    24 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 168 registers, 440 bytes cmem[0]
ptxas info    : Compiling entry function '_Z12SearchDevicePfS_PiS0_iiiiiiiS0_' for 'sm_75'
ptxas info    : Function properties for _Z12SearchDevicePfS_PiS0_iiiiiiiS0_
    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 154 registers, 424 bytes cmem[0]
ptxas info    : Function properties for _ZN18cooperative_groups4__v112thread_groupC2Ej
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for clock64
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for __trap
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for sqrtf
    16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info    : Function properties for __ffs
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d14__barrier_syncEj
    16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads
ptxas info    : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d13__ballot_syncEji
    16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads
ptxas info    : Function properties for _ZN38_INTERNAL_ad40485a_8_build_cu_8d750a5d16__shfl_down_syncEjfji
    32 bytes stack frame, 28 bytes spill stores, 28 bytes spill loads

instance_name="build_${dim}_${dis}"

mv build ${instance_name}
cp ${instance_name} ..
rm ${instance_name}

my nvcc version: nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
yuyuanhang commented 1 week ago

Hi, @jiangyinzuo,

It seems that the compiler g++ is defined multiple times. You can resolve this issue by explicitly setting the compiler with --compiler-bindir=/usr/bin/gcc-x. By the way, could you let me know the version of g++ you are using?

jiangyinzuo commented 1 week ago
jiangyinzuo@kiwi:/usr3/jiangyinzuo_data/GANNS$ g++ --version
g++ (conda-forge gcc 11.4.0-13) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
jiangyinzuo commented 1 week ago

https://github.com/yuyuanhang/GANNS/blob/a40c36db303a7da44ca44e2610a1a02ec2ba95c3/graph_index/navigable_small_world.h#L141-L144

Some Cuda APIs like cudaMallocHost(...) return cudaError_t, but the code ignores their return value. Maybe we can check these error?

yuyuanhang commented 1 week ago

Hi, @jiangyinzuo,

I'm sorry, I am unsure why this code does not run correctly on your server. I believe it is most likely due to hardware limitations rather than an issue with the code itself. Unfortunately, I cannot provide further suggestions as I am unable to reproduce the issue you are encountering.

jiangyinzuo commented 1 week ago

Thank you for your suggestions! I will try it on my GPU server again, and when I gather more useful debugging information, I will come back for your assistance.

yuyuanhang commented 1 week ago

Hi, @jiangyinzuo,

Okay.