xuebinqin / DIS

This is the repo for our new project Highly Accurate Dichotomous Image Segmentation
Apache License 2.0
2.11k stars 244 forks source link

Assistance with Running DIS on NVIDIA RTX A5000 GPU #96

Closed mlozo closed 7 months ago

mlozo commented 7 months ago

Hi DIS Project Team, I am seeking assistance with running the DIS model on a GPU. I am currently using an NVIDIA RTX A5000 Laptop GPU with 16GB RAM. Following the instructions, I have set up a conda environment named 'pytorch18'. However, when I attempt to train the model with my batch of images using python train_valid_inference_main.py (modified for my data), I encounter a compatibility issue with my GPU and the PyTorch version. The exact error message is:

--- build model ---
/home/ml/anaconda3/envs/pytorch18/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning: 
NVIDIA RTX A5000 Laptop GPU with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
If you want to use the NVIDIA RTX A5000 Laptop GPU GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

Following a suggestion from this PyTorch forum thread, I updated my installation with

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

Now,

print(torch.__version__)

displays 1.13.1, and

print(torch.cuda.is_available())

returns False.

As a result, the model starts training on the CPU, leaving the GPU idle and unused.

I am relatively new to the field of machine learning and have been unable to find a solution to make the model train using the GPU. If I cannot resolve this, the training process will take an excessively long time.

For additional context, here are the outputs of

nvidia-smi && nvcc -V
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000 Laptop GPU    On  | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P8              17W / 115W |     10MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1816      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:16:49_PDT_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0

I would appreciate any guidance or suggestions you can provide to resolve this issue and successfully run the model on my GPU. Thank you for your time and assistance.

xuebinqin commented 7 months ago

https://pytorch.org/ you can check the CUDA version (your current "cuda version is 12.2" seems not compatible with the pytroch you've installed) with different pytorch versions in pytorch's official website.

On Sun, Nov 19, 2023 at 12:10 AM Mateusz @.***> wrote:

Hi DIS Project Team, I am seeking assistance with running the DIS model on a GPU. I am currently using an NVIDIA RTX A5000 Laptop GPU with 16GB RAM. Following the instructions, I have set up a conda environment named 'pytorch18'. However, when I attempt to train the model with my batch of images using python train_valid_inference_main.py (modified for my data), I encounter a compatibility issue with my GPU and the PyTorch version. The exact error message is:

--- build model ---/home/ml/anaconda3/envs/pytorch18/lib/python3.7/site-packages/torch/cuda/init.py:104: UserWarning: NVIDIA RTX A5000 Laptop GPU with CUDA capability sm_86 is not compatible with the current PyTorch installation.The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.If you want to use the NVIDIA RTX A5000 Laptop GPU GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

Following a suggestion from this PyTorch forum thread https://discuss.pytorch.org/t/nvidia-nvidia-rtx-a5000-with-cuda-capability-sm-86-is-not-compatible-with-the-current-pytorch-installation/150593, I updated my installation with

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

Now,

print(torch.version)

displays 1.13.1, and

print(torch.cuda.is_available())

returns False.

As a result, the model starts training on the CPU, leaving the GPU idle and unused.

I am relatively new to the field of machine learning and have been unable to find a solution to make the model train using the GPU. If I cannot resolve this, the training process will take an excessively long time.

For additional context, here are the outputs of

nvidia-smi && nvcc -V

+---------------------------------------------------------------------------------------+| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 ||-----------------------------------------+----------------------+----------------------+| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. || | | MIG M. | |=========================================+======================+======================|| 0 NVIDIA RTX A5000 Laptop GPU On | 00000000:01:00.0 Off | N/A || N/A 47C P8 17W / 115W | 10MiB / 16384MiB | 0% Default || | | N/A |+-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage | |=======================================================================================|| 0 N/A N/A 1816 G /usr/lib/xorg/Xorg 4MiB |+---------------------------------------------------------------------------------------+ nvcc: NVIDIA (R) Cuda compiler driverCopyright (c) 2005-2023 NVIDIA CorporationBuilt on Fri_Nov__3_17:16:49_PDT_2023Cuda compilation tools, release 12.3, V12.3.103Build cuda_12.3.r12.3/compiler.33492891_0

I would appreciate any guidance or suggestions you can provide to resolve this issue and successfully run the model on my GPU. Thank you for your time and assistance.

— Reply to this email directly, view it on GitHub https://github.com/xuebinqin/DIS/issues/96, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSGORK4SX5SESLQL4MCDLDYFG5GBAVCNFSM6AAAAAA7RR74VGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAYDANZUGY4DANI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Xuebin Qin PhD Department of Computing Science University of Alberta, Edmonton, AB, Canada Homepage: https://xuebinqin.github.io/

mlozo commented 7 months ago

Hey @xuebinqin,

Thank you for your guidance - I've successfully managed to get everything up and running as it should. Now, it's working like a fast train. I didn't use the provided conda environment from pytorch18.yml, as updating the libraries to the required versions was almost a nightmare. I constantly faced version conflicts, and the resolution process (finding the correct versions) took ages.

I created a clean conda environment and, looking at the library dependency list, installed each one starting with Python and PyTorch. Everything installed smoothly, and now the model is also processing via the GPU.

Thanks again for your help!