运行train_parallel.sh报错

MathamPollard commented 1 year ago

运行train_parallel.sh报错 transformer版本：4.29.2 cuda版本：11.3 python版本：1.12.1 pytorch版本：3.10.9 torch.cuda.is_available is True 我的train_parallel.sh应该是没问题的主要报错信息： ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cuda', index=0), device(type='cuda', index=1)}.

作者大大，能不能帮我看看这是为什么？？

yuanzhoulvpi2017 commented 1 year ago

你没有修改device_map吧，看看，改一改

AttentionAllUNeed commented 1 year ago

请问解决了嘛我也遇到相同问题希望请教一下二位 @yuanzhoulvpi2017 @MathamPollard

yuanzhoulvpi2017 / zero_nlp

运行train_parallel.sh报错 #115