How to update pytorch so it works with Ampere / 3XXX series GPU's?

qu0laz commented 2 years ago

Hello,

I just started working with Weaviate and was able to successfully run docker compose for the CPU version, however could not get the GPU version to run; link to introduction article.

When attempting to run it I get errors across the transformers, for example:

gpu-ner-transformers-1  | /usr/local/lib/python3.9/site-packages/torch/cuda/__init__.py:106: UserWarning: 
gpu-ner-transformers-1  | NVIDIA GeForce RTX 3060 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
gpu-ner-transformers-1  | The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
gpu-ner-transformers-1  | If you want to use the NVIDIA GeForce RTX 3060 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
gpu-ner-transformers-1  | 
gpu-ner-transformers-1  |   warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

This is followed by blocks of POST /vectors/ HTTP/1.1" 500 Internal Server Error and at the very end gpu-newspublications-1 | {'error': [{'message': 'fail with status 500: CUDA error: no kernel image is available for execution on the device'}]}

I am able to successfully run other gpu docker containers such as docker run -it --rm --gpus all ubuntu nvidia-smi without issue. The important part of the output is NVIDIA-SMI 510.85.02 | Driver Version: 510.85.02 | CUDA Version: 11.6 This is running on a fresh install of ubuntu 20.04 which comes bundled with pytorch 3.8; this tells me the error lies with the pytorch in the image. It is all running on bare metal with no virtualization outside of the docker containers in the article.

From what I can tell the issue is with the pytorch want's up to sm_70 however based on this article that means that it is restricted to older GPU's. In a cloud instance where you can select the GPU and use older ones like a P4 this makes sense. However in a self hosted environment this is more difficult.

Have you encountered any issues with newer GPU's and the version of pytorch that is built into the images in the tutorial? Is there a known workaround for this?

Thanks in advance!

trengrj commented 2 years ago

Transferring to transformers module repository

trengrj commented 2 years ago

Can confirm this issue. Using a GTX 3090 I could fix inside the container by reinstalling torch with --extra-index-url https://download.pytorch.org/whl/cu116

root@C.5407541:/app$ ENABLE_CUDA=1 NVIDIA_VISIBLE_DEVICES=all uvicorn app:app --host 0.0.0.0 --port 8080
INFO:     Started server process [382]
INFO:     Waiting for application startup.
INFO:     CUDA_CORE set to cuda:0
/usr/local/lib/python3.9/site-packages/torch/cuda/__init__.py:146: UserWarning:
NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
^CINFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [382]
root@C.5407541:/app$ pip3 install --upgrade torch==1.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu116
Requirement already satisfied: torch==1.12.0 in /usr/local/lib/python3.9/site-packages (1.12.0)
Collecting torch==1.12.0
  Downloading https://download.pytorch.org/whl/cu116/torch-1.12.0%2Bcu116-cp39-cp39-linux_x86_64.whl (1904.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.9/1.9 GB 1.1 MB/s eta 0:00:00
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.9/site-packages (from torch==1.12.0) (4.4.0)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.12.0
    Uninstalling torch-1.12.0:
      Successfully uninstalled torch-1.12.0
Successfully installed torch-1.12.0+cu116
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 22.0.4; however, version 22.3 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
root@C.5407541:/app$ ENABLE_CUDA=1 NVIDIA_VISIBLE_DEVICES=all uvicorn app:app --host 0.0.0.0 --port 8080
INFO:     Started server process [413]
INFO:     Waiting for application startup.
INFO:     CUDA_CORE set to cuda:0
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

clotodex commented 1 year ago

Is there any progress on this, or an easy way to fix it without forking the repo and setting individual pytorch and cuda versions? I am running into the same issue with my RTX card.

frostronic commented 1 month ago

I'm also having this issue in 2024. Except I'm experiencing this issue with an RTX5000 series card. NVIDIA drivers are setup correctly and the hardware is seen and being used by the Ollama container. I see the topic is still open, is there a recommended solution/work-around?

t2v-transformers | RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

weaviate / t2v-transformers-models

How to update pytorch so it works with Ampere / 3XXX series GPU's? #35