FineTune CUDA out of memory

freecow commented 10 months ago

(chatglm3-finetune) root@g101:/data/ChatGLM3/chatglm3-finetune# python finetune.py --dataset_path ./alpaca --lora_rank 4 --per_device_train_batch_size 1 --gradient_accumulation_steps 1 --max_steps 52000 --save_steps 1000 --save_total_limit 20 --learning_rate 1e-4 --remove_unused_columns false --logging_steps 50 --output_dir output The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. Loading checkpoint shards: 100%|████████████████████████████| 7/7 [00:08<00:00, 1.22s/it] Traceback (most recent call last): File "/data/ChatGLM3/chatglm3-finetune/finetune.py", line 70, in main() File "/data/ChatGLM3/chatglm3-finetune/finetune.py", line 55, in main model = get_peft_model(model, peft_config).to("cuda:1") File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 989, in to return self._apply(convert) File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) [Previous line repeated 1 more time] File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 664, in _apply param_applied = fn(param) File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1016.00 MiB (GPU 1; 23.69 GiB total capacity; 22.27 GiB already allocated; 691.69 MiB free; 22.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Jeru2023 commented 10 months ago

Same here, PyTorch reserved too much memory...

Jeru2023 commented 10 months ago

Try to modify finetune line 38 to set load_in_8bit to true: model = AutoModel.from_pretrained( "{your model path}", load_in_8bit=True, trust_remote_code=True, device_map="auto" ).cuda()

freecow commented 10 months ago

finetune.py line 34 to set load_in_8bit to true and delete half(): Original: model = ChatGLMForConditionalGeneration.from_pretrained( "model", load_in_8bit=False, trust_remote_code=False, device_map="auto" ).half()

Modified: model = ChatGLMForConditionalGeneration.from_pretrained( "model", load_in_8bit=True, trust_remote_code=False, device_map="auto" )

Error Message: File "/root/miniconda3/envs/chatglm3-finetune/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper__index_select)

xxw1995 commented 10 months ago

修改 device_map 参数来指定设备。如果你想使用 GPU，将 device_map="auto" 修改为 device_map="cuda"。如果你想使用 CPU，将其修改为 device_map="cpu"

Jeru2023 commented 10 months ago

修改 device_map 参数来指定设备。如果你想使用 GPU，将 device_map="auto" 修改为 device_map="cuda"。如果你想使用 CPU，将其修改为 device_map="cpu"

In my case, device_map needs to be set to cuda:0 instead of cuda

chenmins commented 10 months ago

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 713 C python 26303MiB | +-----------------------------------------------------------------------------+

freecow commented 10 months ago

刚刚测试了，需要 26GB GPU 显存。 Mon Oct 30 11:36:26 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:A1:00.0 Off | 0 | | N/A 51C P0 327W / 400W | 26305MiB / 81920MiB | 95% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 713 C python 26303MiB | +-----------------------------------------------------------------------------+

那看来不是3090这种24G能玩起来的，毕竟好像也不能多卡FineTune

Jeru2023 commented 10 months ago

24G够了，我是4090单卡，一个epoch10秒，还挺快

Jeru2023 commented 10 months ago

刚刚测试了，需要 26GB GPU 显存。 Mon Oct 30 11:36:26 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:A1:00.0 Off | 0 | | N/A 51C P0 327W / 400W | 26305MiB / 81920MiB | 95% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 713 C python 26303MiB | +-----------------------------------------------------------------------------+

那看来不是3090这种24G能玩起来的，毕竟好像也不能多卡FineTune

今天用3090也试了一下，没问题的

sukibean163 commented 10 months ago

刚刚测试了，需要 26GB GPU 显存。 Mon Oct 30 11:36:26 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:A1:00.0 Off | 0 | | N/A 51C P0 327W / 400W | 26305MiB / 81920MiB | 95% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 713 C python 26303MiB | +-----------------------------------------------------------------------------+

那看来不是3090这种24G能玩起来的，毕竟好像也不能多卡FineTune

今天用3090也试了一下，没问题的

为什么我用24G的4090跑，也是出了同样的问题？

xxw1995 / chatglm3-finetune

FineTune CUDA out of memory #3