zhenhao-huang / CPM-1-Finetune-Text-Generation

Finetune CPM-1 For Text Generation
MIT License
18 stars 3 forks source link

如何使用单机单卡进行训练? #4

Open waduhekxx opened 2 years ago

waduhekxx commented 2 years ago

由于本人条件有限,只有一台机器一个显卡,完成好配置后,无法解决下面的NCCL error: Traceback (most recent call last): File "finetune_text_generation_src.py", line 324, in main() File "finetune_text_generation_src.py", line 208, in main model, optimizer, lr_scheduler = setup_model_and_optimizer(args) File "/root/workspace/CPM-1-Finetune-Text-Generation/utils.py", line 493, in setup_model_and_optimizer model = get_model(args, model_cls) File "/root/workspace/CPM-1-Finetune-Text-Generation/utils.py", line 419, in get_model model = DDP(model) File "/root/workspace/CPM-1-Finetune-Text-Generation/model/distributed.py", line 35, in init dist.broadcast(p, src_rank, group=self.data_parallel_group) File "/root/anaconda3/envs/cpm/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 846, in broadcast work = group.broadcast([tensor], opts) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1595629427478/work/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled system error, NCCL version 2.4.8 Traceback (most recent call last): File "/root/anaconda3/envs/cpm/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/root/anaconda3/envs/cpm/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/anaconda3/envs/cpm/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in main() File "/root/anaconda3/envs/cpm/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main cmd=cmd) subprocess.CalledProcessError: Command '['/root/anaconda3/envs/cpm/bin/python3', '-u', 'finetune_text_generation_src.py', '--local_rank=0', '--do_train', '--do_eval', '--data_dir', './data/novel/preprocessed/', '--model-parallel-size', '1', '--num-layers', '5', '--hidden-size', '2560', '--load', 'checkpoints/', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--tokenizer-type', 'GPT2BPETokenizer', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--lr', '0.00001', '--warmup', '0.1', '--batch-size', '1', '--deepspeed', '--deepspeed_config', '/root/workspace/CPM-1-Finetune-Text-Generation/scripts/novel/../ds_config/ds_finetune_large_fp32.json', '--log-interval', '10', '--eval-interval', '50', '--seed', '23333', '--results_dir', 'results/', '--model_name', 'finetune-novel', '--epoch', '10', '--checkpoint-activations']' returned non-zero exit status 1.

大佬们, 如何使用单卡训练而避免使用NCCL呢?

zhenhao-huang commented 2 years ago

单卡,脚本参数nproc_per_nodemodel-parallel-size需要改为1

waduhekx commented 2 years ago

改了这些参数,但后续还是跑不起来,懵

zhenhao-huang commented 2 years ago

换个pytorch版本试下

waduhekx commented 2 years ago

换个pytorch版本试下

1