zhangfaen / finetune-Qwen2-VL

MIT License
208 stars 20 forks source link

多卡训练VL-7B时报错;显存是足够的,这个训练方式是每张卡一个模型吗? #8

Open weilanzhikong opened 1 month ago

weilanzhikong commented 1 month ago

rank1: Traceback (most recent call last): rank1: File "/storage/garlin/deep_learning/finetune-Qwen2-VL/finetune_distributed.py", line 200, in

rank1: File "/storage/garlin/deep_learning/finetune-Qwen2-VL/finetune_distributed.py", line 182, in train

rank1: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/accelerate/optimizer.py", line 172, in step

rank1: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper rank1: out = func(*args, *kwargs) rank1: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/optimizer.py", line 89, in _use_grad rank1: ret = func(self, args, **kwargs) rank1: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/adamw.py", line 216, in step rank1: has_complex = self._init_group( rank1: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/adamw.py", line 155, in _init_group rank1: state["exp_avg"] = torch.zeros_like( rank1: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.02 GiB. GPU 1 has a total capacity of 79.33 GiB of which 923.69 MiB is free. Including non-PyTorch memory, this process has 78.41 GiB memory in use. Of the allocated memory 76.26 GiB is allocated by PyTorch, and 951.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) W1008 10:43:49.390000 140063144896320 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 587770 closing signal SIGTERM W1008 10:43:49.390000 140063144896320 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 587771 closing signal SIGTERM

zhangfaen commented 1 month ago

It use huggface https://github.com/huggingface/accelerate lib to do distributed train. You may read it to see how to revise the code further to get FSDP training. This repo is mainly for education purpose, so it just juse the simplest distributed training function provided by accelerate.