wilicc / gpu-burn

Multi-GPU CUDA stress test
BSD 2-Clause "Simplified" License
1.37k stars 295 forks source link

Docker run fails: docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]] #93

Open hapasa opened 10 months ago

hapasa commented 10 months ago

Background, trying to test if my machine is stable with any NVidia graphics card before upgrading to 4060 Ti 16Gb. My old NVidia 1050 Ti "kept falling of the bus" according to dmesg, no testing with even older card.

Ubuntu 22.04 5.15.0-89-generic docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

The build went seemingly fine:

$ docker build -t gpu_burn .
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon    190kB
Step 1/11 : ARG CUDA_VERSION=11.8.0
Step 2/11 : ARG IMAGE_DISTRO=ubi8
Step 3/11 : FROM nvidia/cuda:${CUDA_VERSION}-devel-${IMAGE_DISTRO} AS builder
11.8.0-devel-ubi8: Pulling from nvidia/cuda
94343313ec15: Pull complete 
9fb272588c1d: Pull complete 
b9797304348b: Pull complete 
5e33c7d9d941: Pull complete 
2e545d869d81: Pull complete 
3b6f4fdd4835: Pull complete 
186b2cf099be: Pull complete 
bb9948097bcc: Pull complete 
665cacaea78b: Pull complete 
a8b41fa5efb1: Pull complete 
Digest: sha256:07f78c377ad928da58a9da192a4ca978c4050b53c66f6df9461d20cba80db990
Status: Downloaded newer image for nvidia/cuda:11.8.0-devel-ubi8
 ---> 6d4df348e537
Step 4/11 : WORKDIR /build
 ---> Running in c01d88a45cf5
Removing intermediate container c01d88a45cf5
 ---> b3823e8592df
Step 5/11 : COPY . /build/
 ---> 12ed4440781c
Step 6/11 : RUN make
 ---> Running in 98b35750f935
g++  -O3 -Wno-unused-result -I/usr/local/cuda/include -std=c++11 -c gpu_burn-drv.cpp
PATH="/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin::." /usr/local/cuda/bin/nvcc  -I/usr/local/cuda/include -arch=compute_50 -ptx compare.cu -o compare.ptx
g++ -o gpu_burn gpu_burn-drv.o -O3  -lcuda -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64/stubs -L/usr/local/cuda/lib -L/usr/local/cuda/lib/stubs -Wl,-rpath=/usr/local/cuda/lib64 -Wl,-rpath=/usr/local/cuda/lib -lcublas -lcudart
Removing intermediate container 98b35750f935
 ---> 0199671c5b9d
Step 7/11 : FROM nvidia/cuda:${CUDA_VERSION}-runtime-${IMAGE_DISTRO}
11.8.0-runtime-ubi8: Pulling from nvidia/cuda
94343313ec15: Already exists 
9fb272588c1d: Already exists 
b9797304348b: Already exists 
5e33c7d9d941: Already exists 
2e545d869d81: Already exists 
3b6f4fdd4835: Already exists 
186b2cf099be: Already exists 
bb9948097bcc: Already exists 
665cacaea78b: Already exists 
Digest: sha256:b3a3629fd70a0af16e895a832a85d3c54b62d367d2d9a695a0e9b34a74627183
Status: Downloaded newer image for nvidia/cuda:11.8.0-runtime-ubi8
 ---> f2b81eaaed01
Step 8/11 : COPY --from=builder /build/gpu_burn /app/
 ---> 5ded3ba95d1e
Step 9/11 : COPY --from=builder /build/compare.ptx /app/
 ---> d43d5fe967ce
Step 10/11 : WORKDIR /app
 ---> Running in 790d35a87e2b
Removing intermediate container 790d35a87e2b
 ---> 55179c022189
Step 11/11 : CMD ["./gpu_burn", "60"]
 ---> Running in cac5f5969349
Removing intermediate container cac5f5969349
 ---> 0b882c79e890
Successfully built 0b882c79e890
Successfully tagged gpu_burn:latest
$ nvidia-smi
Sat Dec  2 12:50:42 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 750 Ti      On  | 00000000:09:00.0  On |                  N/A |
| 33%   32C    P8               1W /  46W |    451MiB /  2048MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1509      G   /usr/lib/xorg/Xorg                          205MiB |
|    0   N/A  N/A      2564      G   /usr/bin/kwin_x11                            48MiB |
|    0   N/A  N/A      2628      G   /usr/bin/plasmashell                         74MiB |
|    0   N/A  N/A      3017      G   /usr/bin/plasma-discover                     22MiB |
|    0   N/A  N/A     27992      G   ...0424349,12935558332982127916,262144       90MiB |
+---------------------------------------------------------------------------------------+
hapasa commented 10 months ago

Note that I was able to pull the repo and build gpu_burn successfully. It is now running fine.
So problem maybe just with something related to Docker + permissions?

yankee14 commented 7 months ago

I have the same issue

tt2468 commented 6 months ago

It looks like the README is missing some prerequisites in terms of what docker needs in order to run with GPUs: https://stackoverflow.com/a/58432877