BUG: Unexpected error from cudaGetDeviceCount() with official docker image v0.11.0

infinitr0us commented 6 months ago

Describe the bug

Hi, I was trying to deploy the Docker image v0.11.0 with my machine with GPU and drivers (CUDA 12.0) installed. However, an error always pops out during the initialization of the docker container: "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)"

Actually, I was able to see the hardware information being correctly loaded in in the container Web UI: But, I was never able to execute models on GPU. So, I am pretty sure that my GPU and CUDA environment is working (also tested with no issue on LocalAI docker image). I was wondering if it is a torch library related issue or a Xinference reltaed issue? Would appreciate any help and willing to provide more logs if needed.

To Reproduce

To help us to reproduce this bug, please provide information below:

official docker image: xprobe/xinference:v0.11.0, also tried on v0.10.x, v0.9.x, and v0.8.x with no luck
Cuda version: 12.0; Driver version: 525.105.17
I was able to deploy containers using CUDA successfully (e.g., LocalAI official docker image)
Full stack of the error. /opt/conda/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/opt/conda/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality fromtorchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpegorlibpnginstalled before buildingtorchvisionfrom source? warn(2024-05-11 21:55:37,671 xinference.core.supervisor 47 INFO Xinference supervisor 0.0.0.0:15251 started /opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0

Expected behavior

Expected to run the container smoothly with CUDA.

Additional context

Thanks again for your great project. I am willing to provide more info/logs if necessary.

ChengjieLi28 commented 6 months ago

@infinitr0us Could you please help me test this: Build a new image based on our offical image:

FROM xprobe/xinference:v0.11.0

RUN pip install torchvision==0.17.1

And then test it?

In my own machine with two GPUs, I can use xinference it normally with above method.

ChengjieLi28 commented 6 months ago

@infinitr0us Could you use this:

docker pull xprobe/xinference:nightly-bug_torchvision_version

to try again?

infinitr0us commented 6 months ago

@infinitr0us Could you please help me test this: Build a new image based on our offical image:
FROM xprobe/xinference:v0.11.0

RUN pip install torchvision==0.17.1
And then test it?

In my own machine with two GPUs, I can use xinference it normally with above method.

Thanks a lot for your response. I tried to install torchvision 0.17.1 in the official inference docker image v0.11.0 through sudo docker exec xinference pip install torchvision==0.17.1 And it seems that at least the installation is working well, but the cudaGetDeviceCount() error is still there. I am gonna give the nightly image a try and get back to you later.

infinitr0us commented 6 months ago

@infinitr0us Could you use this:
docker pull xprobe/xinference:nightly-bug_torchvision_version
to try again?

Oooops... it seems that the cudaGetDeviceCount() error is still there.... Okay, I will examine the torch libraries setup in LocalAI docker image, and see if there is any difference

infinitr0us commented 6 months ago

@ChengjieLi28 I tried with LocalAI again, and it seems that their docker image does not utilize local torch environment at all... Now, I see the problem. It must be the torch library issue with my CUDA and driver environment. Thanks a lot for spending your time helping me diagnose this issue. I think if this issue is limited to my machine only, it is just an edge case. Thanks again and feel free to close it

xorbitsai / inference