zjunlp / KnowLM

An Open-sourced Knowledgable Large Language Model Framework.
http://knowlm.zjukg.cn/
MIT License
1.24k stars 124 forks source link

How to resolve out of memory #100

Closed jimmy-walker closed 10 months ago

jimmy-walker commented 11 months ago

I have 8 NVIDIA GeForce RTX 2080 Ti. I have used following command to run the project:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python examples/generate_lora.py --base_model /data2/user/LLM/knowlm-13b-zhixi --multi_gpu --allocate [8,8,8,8,8,8,8,8] --run_ie_cases

But I faced the error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total
capacity; 7.83 GiB already allocated; 1.94 GiB free; 8.00 GiB allowed; 8.00 GiB reserved in
total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to
avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

It's so wired why 1.94 GiB free is available, but fail to allocate 16.00 MiB?

My enviroment is as followed:

NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7
Python 3.9.13
torch  1.13.1+cu117

Any help is appreciated.

MikeDean2367 commented 10 months ago

You can try the following two methods:

  1. Replace the float16 in line 122 with bfloat16
  2. Attempt to not specify the allocate

Please let me know if there are still issues :)

jimmy-walker commented 10 months ago

Thanks for your reply. @MikeDean2367 I have tried your solutions. But it still got false message.

  1. I have tried bfloat16, but it reminds me that AssertionError:bfloat16is not supported on your device. Please setdtypetofloat16orfloat32``.

2.I have tried to not specify the allocate, but it still got the out of memory:

OutOfMemoryError: CUDA out of memory. Tried to allocate 82.00 MiB (GPU 0; 10.76 GiB total capacity; 9.63 GiB
already allocated; 75.56 MiB free; 9.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated
memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF
MikeDean2367 commented 10 months ago

Hello, we have fixed this issue. When setting the --allocate, please ensure that the first GPU occupies a smaller amount of memory according to the link.

Please let me know if there are still issues :)

jimmy-walker commented 10 months ago

Thanks for your reply. @MikeDean2367 I have read the link you provided, and then I changed the command as followed which first gpu only takes a little memory compared to other gpus. But it still output the error.

CUDA_VISIBLE_DEVICES=0,1,2,3 python examples/generate_lora.py --base_model /data2/user/LLM/knowlm-13b-zhixi --multi_gpu --allocate [2,10,10,10] --run_ie_cases
OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total capacity; 1.84 GiB
already allocated; 7.94 GiB free; 2.00 GiB allowed; 1.99 GiB reserved in total by PyTorch) If reserved memory
is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF
MikeDean2367 commented 10 months ago

Do you pull the code from the latest repository? We have updated the code :)

jimmy-walker commented 10 months ago

Yes, you are right. The error is gone with your updated code. Thank you indeed. @MikeDean2367