Open weilanzhikong opened 1 month ago
It use huggface https://github.com/huggingface/accelerate lib to do distributed train. You may read it to see how to revise the code further to get FSDP training. This repo is mainly for education purpose, so it just juse the simplest distributed training function provided by accelerate.
rank1: Traceback (most recent call last): rank1: File "/storage/garlin/deep_learning/finetune-Qwen2-VL/finetune_distributed.py", line 200, in
rank1: File "/storage/garlin/deep_learning/finetune-Qwen2-VL/finetune_distributed.py", line 182, in train
rank1: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/accelerate/optimizer.py", line 172, in step
rank1: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper rank1: out = func(*args, *kwargs) rank1: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/optimizer.py", line 89, in _use_grad rank1: ret = func(self, args, **kwargs) rank1: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/adamw.py", line 216, in step rank1: has_complex = self._init_group( rank1: File "/storage/garlin/.env/qwen_vl/lib/python3.10/site-packages/torch/optim/adamw.py", line 155, in _init_group rank1: state["exp_avg"] = torch.zeros_like( rank1: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.02 GiB. GPU 1 has a total capacity of 79.33 GiB of which 923.69 MiB is free. Including non-PyTorch memory, this process has 78.41 GiB memory in use. Of the allocated memory 76.26 GiB is allocated by PyTorch, and 951.92 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) W1008 10:43:49.390000 140063144896320 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 587770 closing signal SIGTERM W1008 10:43:49.390000 140063144896320 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 587771 closing signal SIGTERM