Closed infinitr0us closed 6 months ago
@infinitr0us Could you please help me test this: Build a new image based on our offical image:
FROM xprobe/xinference:v0.11.0
RUN pip install torchvision==0.17.1
And then test it?
In my own machine with two GPUs, I can use xinference it normally with above method.
@infinitr0us Could you use this:
docker pull xprobe/xinference:nightly-bug_torchvision_version
to try again?
@infinitr0us Could you please help me test this: Build a new image based on our offical image:
FROM xprobe/xinference:v0.11.0 RUN pip install torchvision==0.17.1
And then test it?
In my own machine with two GPUs, I can use xinference it normally with above method.
Thanks a lot for your response. I tried to install torchvision 0.17.1 in the official inference docker image v0.11.0 through
sudo docker exec xinference pip install torchvision==0.17.1
And it seems that at least the installation is working well, but the cudaGetDeviceCount() error is still there.
I am gonna give the nightly image a try and get back to you later.
@infinitr0us Could you use this:
docker pull xprobe/xinference:nightly-bug_torchvision_version
to try again?
Oooops... it seems that the cudaGetDeviceCount() error is still there.... Okay, I will examine the torch libraries setup in LocalAI docker image, and see if there is any difference
@ChengjieLi28 I tried with LocalAI again, and it seems that their docker image does not utilize local torch environment at all... Now, I see the problem. It must be the torch library issue with my CUDA and driver environment. Thanks a lot for spending your time helping me diagnose this issue. I think if this issue is limited to my machine only, it is just an edge case. Thanks again and feel free to close it
Describe the bug
Hi, I was trying to deploy the Docker image v0.11.0 with my machine with GPU and drivers (CUDA 12.0) installed. However, an error always pops out during the initialization of the docker container:
"/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)"
Actually, I was able to see the hardware information being correctly loaded in in the container Web UI: But, I was never able to execute models on GPU. So, I am pretty sure that my GPU and CUDA environment is working (also tested with no issue on LocalAI docker image). I was wondering if it is a torch library related issue or a Xinference reltaed issue? Would appreciate any help and willing to provide more logs if needed.
To Reproduce
To help us to reproduce this bug, please provide information below:
/opt/conda/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/opt/conda/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from
torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have
libjpegor
libpnginstalled before building
torchvisionfrom source? warn(2024-05-11 21:55:37,671 xinference.core.supervisor 47 INFO Xinference supervisor 0.0.0.0:15251 started /opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0
Expected behavior
Expected to run the container smoothly with CUDA.
Additional context
Thanks again for your great project. I am willing to provide more info/logs if necessary.