yangjianxin1 / Firefly

Firefly: 大模型训练工具,支持训练Qwen2.5、Qwen2、Yi1.5、Phi-3、Llama3、Gemma、MiniCPM、Yi、Deepseek、Orion、Xverse、Mixtral-8x7B、Zephyr、Mistral、Baichuan2、Llma2、Llama、Qwen、Baichuan、ChatGLM2、InternLM、Ziya2、Vicuna、Bloom等大模型
5.89k stars 526 forks source link

ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. #270

Open WeixuanXiong opened 5 months ago

WeixuanXiong commented 5 months ago

在用 torchrun --nproc_per_node=4 train.py --train_args_file train_args/sft/qlora/qwen2-7b-sft-qlora.json 训练qwen2+qlora+unsloth时(use_unsloth=true)出现错误: ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example device_map={'':torch.cuda.current_device()}you're training on. Make sure you loaded the model on the correct device using for exampledevice_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}

qwen2-7b-sft-qlora.json文件参数设置如下: image

完整错误如下: 2024-06-20 01:48:35.195 | INFO | main:init_components:388 - Train model with sft task 2024-06-20 01:48:35.196 | INFO | main:load_sft_dataset:351 - Loading data with UnifiedSFTDataset 2024-06-20 01:48:35.196 | INFO | component.dataset:init:19 - Loading data: ./data/dummy_data.jsonl 2024-06-20 01:48:35.197 | INFO | component.dataset:init:22 - Use template "qwen" for training 2024-06-20 01:48:35.197 | INFO | component.dataset:init:23 - There are 33 data in dataset 2024-06-20 01:48:35.207 | INFO | main:main:426 - starting training Traceback (most recent call last): File "/dfs/data/code/Firefly/train.py", line 439, in main() File "/dfs/data/code/Firefly/train.py", line 427, in main train_result = trainer.train() File "/dfs/data/hujh9/miniconda/envs/firefly/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "", line 159, in _fast_inner_training_loop File "/dfs/data/hujh9/miniconda/envs/firefly/lib/python3.9/site-packages/accelerate/accelerator.py", line 1202, in prepare result = tuple( File "/dfs/data/hujh9/miniconda/envs/firefly/lib/python3.9/site-packages/accelerate/accelerator.py", line 1203, in self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) File "/dfs/data/hujh9/miniconda/envs/firefly/lib/python3.9/site-packages/accelerate/accelerator.py", line 1030, in _prepare_one return self.prepare_model(obj, device_placement=device_placement) File "/dfs/data/hujh9/miniconda/envs/firefly/lib/python3.9/site-packages/accelerate/accelerator.py", line 1281, in prepare_model raise ValueError( ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example device_map={'':torch.cuda.current_device()}you're training on. Make sure you loaded the model on the correct device using for exampledevice_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}

不使用unsloth,单机多卡正常训练,使用unsloth,单机单卡也可以正常训练,只有在unsloth+多卡的时候报错,请问这是因为什么呢?

yangjianxin1 commented 5 months ago

unsloth暂时仅支持单卡训练