Open taofennanhai opened 1 year ago
请问你是直接运行python train_chatglm_all.py跑通的吗? 还是有其他设置?
我这边报错: ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cuda', index=0), device(type='cuda', index=1)
我这边已经跑通train_chatglm_all.py,想问下并行化条件是不是只需要把权重分布到不同卡上?
模型并行的条件:
可能是你修改的device_map
不对,务必仔细阅读我的readme.md
可能是你修改的
device_map
不对,务必仔细阅读我的readme.md
多谢回复。
device_map没有做修改,然后出现的问题是没有用到你thuglm下的代码。现在把最新的模型copy到thuglmm下,又出现了新的问题。
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
Traceback (most recent call last):
File "/home/87oo/data/workspace/zero_nlp/Chatglm6b_ModelParallel/train_model_all.py", line 86, in <module>
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
File "/home/87oo/anaconda3/envs/zero_nlp/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 678, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/87oo/anaconda3/envs/zero_nlp/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1825, in from_pretrained
return cls._from_pretrained(
File "/home/87oo/anaconda3/envs/zero_nlp/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1988, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/87oo/.cache/huggingface/modules/transformers_modules/chatglm-6b-zero-nlp/tokenization_chatglm.py", line 215, in __init__
self.sp_tokenizer = SPTokenizer(vocab_file)
File "/home/87oo/.cache/huggingface/modules/transformers_modules/chatglm-6b-zero-nlp/tokenization_chatglm.py", line 35, in __init__
self.text_tokenizer = self._build_text_tokenizer(encode_special_tokens=False)
File "/home/87oo/.cache/huggingface/modules/transformers_modules/chatglm-6b-zero-nlp/tokenization_chatglm.py", line 68, in _build_text_tokenizer
self._configure_tokenizer(
File "/home/87oo/.cache/huggingface/modules/transformers_modules/chatglm-6b-zero-nlp/tokenization_chatglm.py", line 64, in _configure_tokenizer
text_tokenizer.refresh()
File "/home/87oo/anaconda3/envs/zero_nlp/lib/python3.9/site-packages/icetk/text_tokenizer.py", line 31, in refresh
self.sp.Load(model_proto=self.proto.SerializeToString())
File "/home/87oo/anaconda3/envs/zero_nlp/lib/python3.9/site-packages/sentencepiece/__init__.py", line 366, in Load
return self.LoadFromSerializedProto(model_proto)
File "/home/87oo/anaconda3/envs/zero_nlp/lib/python3.9/site-packages/sentencepiece/__init__.py", line 75, in LoadFromSerializedProto
return _sentencepiece.SentencePieceProcessor_LoadFromSerializedProto(self, serialized)
RuntimeError: Internal: [MASK] is already defined.
请问这是什么原因?
可能是你修改的
device_map
不对,务必仔细阅读我的readme.md
多谢回复。
device_map没有做修改,然后出现的问题是没有用到你thuglm下的代码。现在把最新的模型copy到thuglmm下,又出现了新的问题。
===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues ================================================================================ Traceback (most recent call last): File "/home/87oo/data/workspace/zero_nlp/Chatglm6b_ModelParallel/train_model_all.py", line 86, in <module> tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) File "/home/87oo/anaconda3/envs/zero_nlp/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 678, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/home/87oo/anaconda3/envs/zero_nlp/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1825, in from_pretrained return cls._from_pretrained( File "/home/87oo/anaconda3/envs/zero_nlp/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1988, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/home/87oo/.cache/huggingface/modules/transformers_modules/chatglm-6b-zero-nlp/tokenization_chatglm.py", line 215, in __init__ self.sp_tokenizer = SPTokenizer(vocab_file) File "/home/87oo/.cache/huggingface/modules/transformers_modules/chatglm-6b-zero-nlp/tokenization_chatglm.py", line 35, in __init__ self.text_tokenizer = self._build_text_tokenizer(encode_special_tokens=False) File "/home/87oo/.cache/huggingface/modules/transformers_modules/chatglm-6b-zero-nlp/tokenization_chatglm.py", line 68, in _build_text_tokenizer self._configure_tokenizer( File "/home/87oo/.cache/huggingface/modules/transformers_modules/chatglm-6b-zero-nlp/tokenization_chatglm.py", line 64, in _configure_tokenizer text_tokenizer.refresh() File "/home/87oo/anaconda3/envs/zero_nlp/lib/python3.9/site-packages/icetk/text_tokenizer.py", line 31, in refresh self.sp.Load(model_proto=self.proto.SerializeToString()) File "/home/87oo/anaconda3/envs/zero_nlp/lib/python3.9/site-packages/sentencepiece/__init__.py", line 366, in Load return self.LoadFromSerializedProto(model_proto) File "/home/87oo/anaconda3/envs/zero_nlp/lib/python3.9/site-packages/sentencepiece/__init__.py", line 75, in LoadFromSerializedProto return _sentencepiece.SentencePieceProcessor_LoadFromSerializedProto(self, serialized) RuntimeError: Internal: [MASK] is already defined.
请问这是什么原因?
chatglm6b-dddd
下载了chatglm6b-dddd模型,将模型文件和一些json文件放在Chatglm6b_ModelParallel/thuglm 文件夹下。python train_model_all.py运行。仍然报最初的错误,能麻烦作者在你本地最新代码和最新模型再试下嘛?
报错如下:
Traceback (most recent call last):
File "/home/87oo/data/workspace/zero_nlp/Chatglm6b_ModelParallel/train_model_all.py", line 322, in
用了四张卡,device_map_dict如下:
device_map_dict = {'transformer.word_embeddings': 0, 'transformer.layers.0': 0, 'transformer.layers.1': 0, 'transformer.layers.2': 0, 'transformer.layers.3': 0, 'transformer.layers.4': 0, 'transformer.layers.5': 0, 'transformer.layers.6': 1, 'transformer.layers.7': 1, 'transformer.layers.8': 1, 'transformer.layers.9': 1, 'transformer.layers.10': 1, 'transformer.layers.11': 1, 'transformer.layers.12': 1, 'transformer.layers.13': 1, 'transformer.layers.14': 2, 'transformer.layers.15': 2, 'transformer.layers.16': 2, 'transformer.layers.17': 2, 'transformer.layers.18': 2, 'transformer.layers.19': 2, 'transformer.layers.20': 2, 'transformer.layers.21': 2, 'transformer.layers.22': 3, 'transformer.layers.23': 3, 'transformer.layers.24': 3, 'transformer.layers.25': 3, 'transformer.layers.26': 3, 'transformer.layers.27': 3, 'transformer.final_layernorm': 3, 'lm_head': 3 }
- 并行,就是将模型的参数,分别放在不同卡上;
- 在训练的时候,各个网络层里面的数据也需要自
这第二个要求是怎么办到的?chatglm模型会把计算的数据自动分配的不同卡上吗
下载了chatglm6b-dddd模型,将模型文件和一些json文件放在Chatglm6b_ModelParallel/thuglm 文件夹下。python train_model_all.py运行。仍然报最初的错误,能麻烦作者在你本地最新代码和最新模型再试下嘛?
报错如下:
Traceback (most recent call last): File "/home/87oo/data/workspace/zero_nlp/Chatglm6b_ModelParallel/train_model_all.py", line 322, in trainer.train() File "/home/87oo/data/workspace/zero_nlp/Chatglm6b_ModelParallel/MyTrainer.py", line 1629, in train return inner_training_loop( File "/home/87oo/data/workspace/zero_nlp/Chatglm6b_ModelParallel/MyTrainer.py", line 1716, in _inner_training_loop model = self._wrap_model(self.model_wrapped) File "/home/87oo/data/workspace/zero_nlp/Chatglm6b_ModelParallel/MyTrainer.py", line 1541, in _wrap_model model = nn.parallel.DistributedDataParallel( File "/home/87oo/anaconda3/envs/zero_nlp/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 582, in init self._log_and_throw( File "/home/87oo/anaconda3/envs/zero_nlp/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 686, in _log_and_throw raise err_type(err_msg) ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cuda', index=0), device(type='cuda', index=1), device(type='cuda', index=2), device(type='cuda', index=3)}.
用了四张卡,device_map_dict如下:
device_map_dict = {'transformer.word_embeddings': 0, 'transformer.layers.0': 0, 'transformer.layers.1': 0, 'transformer.layers.2': 0, 'transformer.layers.3': 0, 'transformer.layers.4': 0, 'transformer.layers.5': 0, 'transformer.layers.6': 1, 'transformer.layers.7': 1, 'transformer.layers.8': 1, 'transformer.layers.9': 1, 'transformer.layers.10': 1, 'transformer.layers.11': 1, 'transformer.layers.12': 1, 'transformer.layers.13': 1, 'transformer.layers.14': 2, 'transformer.layers.15': 2, 'transformer.layers.16': 2, 'transformer.layers.17': 2, 'transformer.layers.18': 2, 'transformer.layers.19': 2, 'transformer.layers.20': 2, 'transformer.layers.21': 2, 'transformer.layers.22': 3, 'transformer.layers.23': 3, 'transformer.layers.24': 3, 'transformer.layers.25': 3, 'transformer.layers.26': 3, 'transformer.layers.27': 3, 'transformer.final_layernorm': 3, 'lm_head': 3 }
'transformer.final_layernorm': 3, 'lm_head': 3 这两层放到第0块卡上试试
我这边已经跑通train_chatglm_all.py,想问下并行化条件是不是只需要把权重分布到不同卡上?
模型并行的条件:
- 模型并行,就是将模型的参数,分别放在不同卡上;
- 在训练的时候,各个网络层里面的数据也需要自动切换到不同卡上;
貌似开启并行的条件不能少了 'model_parallel' 吗? 我同样使用代码让GLM-10b开启并行,但是会报错'GLMForConditionalGeneration' object has no attribute 'model_parallel'
用chatglm-6b-v2吧,第一版本的就不要看了https://github.com/yuanzhoulvpi2017/zero_nlp/tree/main/chatglm_v2_6b_lora
我这边已经跑通train_chatglm_all.py,想问下并行化条件是不是只需要把权重分布到不同卡上?