yangjianxin1 / Firefly

Firefly: 大模型训练工具,支持训练Qwen2.5、Qwen2、Yi1.5、Phi-3、Llama3、Gemma、MiniCPM、Yi、Deepseek、Orion、Xverse、Mixtral-8x7B、Zephyr、Mistral、Baichuan2、Llma2、Llama、Qwen、Baichuan、ChatGLM2、InternLM、Ziya2、Vicuna、Bloom等大模型
5.7k stars 518 forks source link

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 问下大佬 这个错误怎么解决? #32

Open zxy333666 opened 1 year ago

zxy333666 commented 1 year ago

CUDA SETUP: Highest compute capability among GPUs detected: 8.0 CUDA SETUP: Detected CUDA version 116 CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so... 2023-06-21 07:39:30.573 | INFO | main:init_components:100 - Initializing components... 2023-06-21 07:39:30.573 | INFO | main:init_components:100 - Initializing components... ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /data/firefly/Firefly/train_qlora.py:196 in │ │ │ │ 193 │ │ 194 │ │ 195 if name == "main": │ │ ❱ 196 │ main() │ │ 197 │ │ 198 │ │ 199 │ │ │ │ /data/firefly/Firefly/train_qlora.py:181 in main │ │ │ │ 178 │ # 进行一些配置和检查 │ │ 179 │ args, training_args = setup_everything() │ │ 180 │ # 加载各种组件 │ │ ❱ 181 │ trainer = init_components(args, training_args) │ │ 182 │ # 开始训练 │ │ 183 │ logger.info(" starting training ") │ │ 184 │ train_result = trainer.train() │ │ │ │ /data/firefly/Firefly/train_qlora.py:111 in init_components │ │ │ │ 108 │ │ local_rank = int(os.environ.get('LOCAL_RANK', '0')) │ │ 109 │ │ device_map = {'': local_rank} │ │ 110 │ # 加载tokenzier │ │ ❱ 111 │ tokenizer = AutoTokenizer.from_pretrained( │ │ 112 │ │ args.model_name_or_path, │ │ 113 │ │ trust_remote_code=True, │ │ 114 │ ) │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:702 in │ │ from_pretrained │ │ │ │ 699 │ │ │ │ raise ValueError( │ │ 700 │ │ │ │ │ f"Tokenizer class {tokenizer_class_candidate} does not exist or is n │ │ 701 │ │ │ │ ) │ │ ❱ 702 │ │ │ return tokenizer_class.from_pretrained(pretrained_model_name_or_path, input │ │ 703 │ │ │ │ 704 │ │ # Otherwise we have to be creative. │ │ 705 │ │ # if model is an encoder decoder, the encoder tokenizer class is used by default │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1811 in │ │ from_pretrained │ │ │ │ 1808 │ │ │ else: │ │ 1809 │ │ │ │ logger.info(f"loading file {file_path} from cache at {resolved_vocab_fil │ │ 1810 │ │ │ │ ❱ 1811 │ │ return cls._from_pretrained( │ │ 1812 │ │ │ resolved_vocab_files, │ │ 1813 │ │ │ pretrained_model_name_or_path, │ │ 1814 │ │ │ init_configuration, │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1965 in │ │ _from_pretrained │ │ │ │ 1962 │ │ │ │ 1963 │ │ # Instantiate tokenizer. │ │ 1964 │ │ try: │ │ ❱ 1965 │ │ │ tokenizer = cls(init_inputs, init_kwargs) │ │ 1966 │ │ except OSError: │ │ 1967 │ │ │ raise OSError( │ │ 1968 │ │ │ │ "Unable to load vocabulary from file. " │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/models/bloom/tokenization_bloom_fast.py:121 │ │ in init │ │ │ │ 118 │ │ clean_up_tokenization_spaces=False, │ │ 119 │ │ kwargs, │ │ 120 │ ): │ │ ❱ 121 │ │ super().init( │ │ 122 │ │ │ vocab_file, │ │ 123 │ │ │ merges_file, │ │ 124 │ │ │ tokenizer_file=tokenizer_file, │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:111 in init │ │ │ │ 108 │ │ │ fast_tokenizer = copy.deepcopy(tokenizer_object) │ │ 109 │ │ elif fast_tokenizer_file is not None and not from_slow: │ │ 110 │ │ │ # We have a serialization from tokenizers which let us directly build the ba │ │ ❱ 111 │ │ │ fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file) │ │ 112 │ │ elif slow_tokenizer is not None: │ │ 113 │ │ │ # We need to convert a slow tokenizer to build the backend │ │ 114 │ │ │ fast_tokenizer = convert_slow_tokenizer(slow_tokenizer) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ Exception: expected value at line 1 column 1 ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /data/firefly/Firefly/train_qlora.py:196 in │ │ │ │ 193 │ │ 194 │ │ 195 if name == "main": │ │ ❱ 196 │ main() │ │ 197 │ │ 198 │ │ 199 │ │ │ │ /data/firefly/Firefly/train_qlora.py:181 in main │ │ │ │ 178 │ # 进行一些配置和检查 │ │ 179 │ args, training_args = setup_everything() │ │ 180 │ # 加载各种组件 │ │ ❱ 181 │ trainer = init_components(args, training_args) │ │ 182 │ # 开始训练 │ │ 183 │ logger.info(" starting training ") │ │ 184 │ train_result = trainer.train() │ │ │ │ /data/firefly/Firefly/train_qlora.py:111 in init_components │ │ │ │ 108 │ │ local_rank = int(os.environ.get('LOCAL_RANK', '0')) │ │ 109 │ │ device_map = {'': local_rank} │ │ 110 │ # 加载tokenzier │ │ ❱ 111 │ tokenizer = AutoTokenizer.from_pretrained( │ │ 112 │ │ args.model_name_or_path, │ │ 113 │ │ trust_remote_code=True, │ │ 114 │ ) │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:702 in │ │ from_pretrained │ │ │ │ 699 │ │ │ │ raise ValueError( │ │ 700 │ │ │ │ │ f"Tokenizer class {tokenizer_class_candidate} does not exist or is n │ │ 701 │ │ │ │ ) │ │ ❱ 702 │ │ │ return tokenizer_class.from_pretrained(pretrained_model_name_or_path, input │ │ 703 │ │ │ │ 704 │ │ # Otherwise we have to be creative. │ │ 705 │ │ # if model is an encoder decoder, the encoder tokenizer class is used by default │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1811 in │ │ from_pretrained │ │ │ │ 1808 │ │ │ else: │ │ 1809 │ │ │ │ logger.info(f"loading file {file_path} from cache at {resolved_vocab_fil │ │ 1810 │ │ │ │ ❱ 1811 │ │ return cls._from_pretrained( │ │ 1812 │ │ │ resolved_vocab_files, │ │ 1813 │ │ │ pretrained_model_name_or_path, │ │ 1814 │ │ │ init_configuration, │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1965 in │ │ _from_pretrained │ │ │ │ 1962 │ │ │ │ 1963 │ │ # Instantiate tokenizer. │ │ 1964 │ │ try: │ │ ❱ 1965 │ │ │ tokenizer = cls(init_inputs, init_kwargs) │ │ 1966 │ │ except OSError: │ │ 1967 │ │ │ raise OSError( │ │ 1968 │ │ │ │ "Unable to load vocabulary from file. " │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/models/bloom/tokenization_bloom_fast.py:121 │ │ in init │ │ │ │ 118 │ │ clean_up_tokenization_spaces=False, │ │ 119 │ │ kwargs, │ │ 120 │ ): │ │ ❱ 121 │ │ super().init( │ │ 122 │ │ │ vocab_file, │ │ 123 │ │ │ merges_file, │ │ 124 │ │ │ tokenizer_file=tokenizer_file, │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:111 in init │ │ │ │ 108 │ │ │ fast_tokenizer = copy.deepcopy(tokenizer_object) │ │ 109 │ │ elif fast_tokenizer_file is not None and not from_slow: │ │ 110 │ │ │ # We have a serialization from tokenizers which let us directly build the ba │ │ ❱ 111 │ │ │ fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file) │ │ 112 │ │ elif slow_tokenizer is not None: │ │ 113 │ │ │ # We need to convert a slow tokenizer to build the backend │ │ 114 │ │ │ fast_tokenizer = convert_slow_tokenizer(slow_tokenizer) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ Exception: expected value at line 1 column 1 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 24024) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_qlora.py FAILED

Failures: [1]: time : 2023-06-21_07:39:33 host : dbcloud rank : 1 (local_rank: 1) exitcode : 1 (pid: 24025) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-06-21_07:39:33 host : dbcloud rank : 0 (local_rank: 0) exitcode : 1 (pid: 24024) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

rDearApple commented 1 year ago

我也是这个错 麻烦大佬看一下 谢谢

rDearApple commented 1 year ago

CUDA SETUP: Highest compute capability among GPUs detected: 8.0

CUDA SETUP: Detected CUDA version 116 CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so... 2023-06-21 07:39:30.573 | INFO | main:init_components:100 - Initializing components... 2023-06-21 07:39:30.573 | INFO | main:init_components:100 - Initializing components... ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /data/firefly/Firefly/train_qlora.py:196 in │ │ │ │ 193 │ │ 194 │ │ 195 if name == "main": │ │ ❱ 196 │ main() │ │ 197 │ │ 198 │ │ 199 │ │ │ │ /data/firefly/Firefly/train_qlora.py:181 in main │ │ │ │ 178 │ # 进行一些配置和检查 │ │ 179 │ args, training_args = setup_everything() │ │ 180 │ # 加载各种组件 │ │ ❱ 181 │ trainer = init_components(args, training_args) │ │ 182 │ # 开始训练 │ │ 183 │ logger.info(" starting training ") │ │ 184 │ train_result = trainer.train() │ │ │ │ /data/firefly/Firefly/train_qlora.py:111 in init_components │ │ │ │ 108 │ │ local_rank = int(os.environ.get('LOCAL_RANK', '0')) │ │ 109 │ │ device_map = {'': local_rank} │ │ 110 │ # 加载tokenzier │ │ ❱ 111 │ tokenizer = AutoTokenizer.from_pretrained( │ │ 112 │ │ args.model_name_or_path, │ │ 113 │ │ trust_remote_code=True, │ │ 114 │ ) │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:702 in │ │ from_pretrained │ │ │ │ 699 │ │ │ │ raise ValueError( │ │ 700 │ │ │ │ │ f"Tokenizer class {tokenizer_class_candidate} does not exist or is n │ │ 701 │ │ │ │ ) │ │ ❱ 702 │ │ │ return tokenizer_class.from_pretrained(pretrained_model_name_or_path, input │ │ 703 │ │ │ │ 704 │ │ # Otherwise we have to be creative. │ │ 705 │ │ # if model is an encoder decoder, the encoder tokenizer class is used by default │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1811 in │ │ from_pretrained │ │ │ │ 1808 │ │ │ else: │ │ 1809 │ │ │ │ logger.info(f"loading file {file_path} from cache at {resolved_vocab_fil │ │ 1810 │ │ │ │ ❱ 1811 │ │ return cls._from_pretrained( │ │ 1812 │ │ │ resolved_vocab_files, │ │ 1813 │ │ │ pretrained_model_name_or_path, │ │ 1814 │ │ │ init_configuration, │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1965 in │ │ _from_pretrained │ │ │ │ 1962 │ │ │ │ 1963 │ │ # Instantiate tokenizer. │ │ 1964 │ │ try: │ │ ❱ 1965 │ │ │ tokenizer = cls(init_inputs, *_init_kwargs) │ │ 1966 │ │ except OSError: │ │ 1967 │ │ │ raise OSError( │ │ 1968 │ │ │ │ "Unable to load vocabulary from file. " │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/models/bloom/tokenization_bloom_fast.py:121 │ │ in init │ │ │ │ 118 │ │ clean_up_tokenization_spaces=False, │ │ 119 │ │ kwargs, │ │ 120 │ ): │ │ ❱ 121 │ │ super().init( │ │ 122 │ │ │ vocab_file, │ │ 123 │ │ │ merges_file, │ │ 124 │ │ │ tokenizer_file=tokenizer_file, │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:111 in init │ │ │ │ 108 │ │ │ fast_tokenizer = copy.deepcopy(tokenizer_object) │ │ 109 │ │ elif fast_tokenizer_file is not None and not from_slow: │ │ 110 │ │ │ # We have a serialization from tokenizers which let us directly build the ba │ │ ❱ 111 │ │ │ fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file) │ │ 112 │ │ elif slow_tokenizer is not None: │ │ 113 │ │ │ # We need to convert a slow tokenizer to build the backend │ │ 114 │ │ │ fast_tokenizer = convert_slow_tokenizer(slow_tokenizer) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ Exception: expected value at line 1 column 1 ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /data/firefly/Firefly/train_qlora.py:196 in │ │ │ │ 193 │ │ 194 │ │ 195 if name == "main": │ │ ❱ 196 │ main() │ │ 197 │ │ 198 │ │ 199 │ │ │ │ /data/firefly/Firefly/train_qlora.py:181 in main │ │ │ │ 178 │ # 进行一些配置和检查 │ │ 179 │ args, training_args = setup_everything() │ │ 180 │ # 加载各种组件 │ │ ❱ 181 │ trainer = init_components(args, training_args) │ │ 182 │ # 开始训练 │ │ 183 │ logger.info("_ starting training ") │ │ 184 │ train_result = trainer.train() │ │ │ │ /data/firefly/Firefly/train_qlora.py:111 in init_components │ │ │ │ 108 │ │ local_rank = int(os.environ.get('LOCAL_RANK', '0')) │ │ 109 │ │ device_map = {'': local_rank} │ │ 110 │ # 加载tokenzier │ │ ❱ 111 │ tokenizer = AutoTokenizer.from_pretrained( │ │ 112 │ │ args.model_name_or_path, │ │ 113 │ │ trust_remote_code=True, │ │ 114 │ ) │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py:702 in │ │ from_pretrained │ │ │ │ 699 │ │ │ │ raise ValueError( │ │ 700 │ │ │ │ │ f"Tokenizer class {tokenizer_class_candidate} does not exist or is n │ │ 701 │ │ │ │ ) │ │ ❱ 702 │ │ │ return tokenizer_class.from_pretrained(pretrained_model_name_or_path, input │ │ 703 │ │ │ │ 704 │ │ # Otherwise we have to be creative. │ │ 705 │ │ # if model is an encoder decoder, the encoder tokenizer class is used by default │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1811 in │ │ from_pretrained │ │ │ │ 1808 │ │ │ else: │ │ 1809 │ │ │ │ logger.info(f"loading file {file_path} from cache at {resolved_vocab_fil │ │ 1810 │ │ │ │ ❱ 1811 │ │ return cls._from_pretrained( │ │ 1812 │ │ │ resolved_vocab_files, │ │ 1813 │ │ │ pretrained_model_name_or_path, │ │ 1814 │ │ │ init_configuration, │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1965 in │ │ _from_pretrained │ │ │ │ 1962 │ │ │ │ 1963 │ │ # Instantiate tokenizer. │ │ 1964 │ │ try: │ │ ❱ 1965 │ │ │ tokenizer = cls(init_inputs, init_kwargs) │ │ 1966 │ │ except OSError: │ │ 1967 │ │ │ raise OSError( │ │ 1968 │ │ │ │ "Unable to load vocabulary from file. " │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/models/bloom/tokenization_bloom_fast.py:121 │ │ in init │ │ │ │ 118 │ │ clean_up_tokenization_spaces=False, │ │ 119 │ │ kwargs, │ │ 120 │ ): │ │ ❱ 121 │ │ super().init( │ │ 122 │ │ │ vocab_file, │ │ 123 │ │ │ merges_file, │ │ 124 │ │ │ tokenizer_file=tokenizer_file, │ │ │ │ /opt/conda/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py:111 in init │ │ │ │ 108 │ │ │ fast_tokenizer = copy.deepcopy(tokenizer_object) │ │ 109 │ │ elif fast_tokenizer_file is not None and not from_slow: │ │ 110 │ │ │ # We have a serialization from tokenizers which let us directly build the ba │ │ ❱ 111 │ │ │ fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file) │ │ 112 │ │ elif slow_tokenizer is not None: │ │ 113 │ │ │ # We need to convert a slow tokenizer to build the backend │ │ 114 │ │ │ fast_tokenizer = convert_slow_tokenizer(slow_tokenizer) │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ Exception: expected value at line 1 column 1 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 24024) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')()) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_qlora.py FAILED

Failures:

[1]: time : 2023-06-21_07:39:33 host : dbcloud rank : 1 (local_rank: 1) exitcode : 1 (pid: 24025) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):

[0]: time : 2023-06-21_07:39:33 host : dbcloud rank : 0 (local_rank: 0) exitcode : 1 (pid: 24024) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

请问解决了吗

rDearApple commented 1 year ago

你好 我这边 把 "gradient_checkpointing": false 设置为false解决了