wenet-e2e / wenet

Production First and Production Ready End-to-End Speech Recognition Toolkit
https://wenet-e2e.github.io/wenet/
Apache License 2.0
4.14k stars 1.07k forks source link

增量训练报错 #2569

Closed LiSongRan closed 3 months ago

LiSongRan commented 3 months ago

bash run.sh --stage 4 --stop_stage 4 CUDA_VISIBLE_DEVICES is 0 run.sh: using torch ddp run.sh: num_nodes is 1, proc_per_node is 1 [2024-07-09 15:28:17,054] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) 2024-07-09 15:28:20,922 INFO use char tokenizer 2024-07-09 15:28:20,922 INFO training on multiple gpus, this gpu 0, rank 0, world_size 1 2024-07-09 15:28:22,882 INFO [Rank 0] Checkpoint: loading from checkpoint /home/SpeechRecognitioin/wenet/examples/aishell/s0/20220506_u2pp_conf ormer_exp_wenetspeech/final.pt 2024-07-09 15:28:22,991 INFO missing tensor: encoder.embed.pos_enc.pe 2024-07-09 15:28:22,991 INFO missing tensor: decoder.left_decoder.embed.1.pe 2024-07-09 15:28:22,991 INFO missing tensor: decoder.right_decoder.embed.1.pe {'accum_grad': 2, 'cmvn': 'global_cmvn', 'cmvn_conf': {'cmvn_file': '/home/SpeechRecognitioin/wenet/examples/aishell/s0/20220506_u2pp_conformer _exp_wenetspeech//global_cmvn', 'is_json_cmvn': True}, 'ctc': 'ctc', 'ctc_conf': {'ctc_blank_id': 0}, 'dataset': 'asr', 'dataset_conf': {'batch _conf': {'batch_size': 2, 'batch_type': 'dynamic', 'max_frames_in_batch': 24000}, 'fbank_conf': {'dither': 1.0, 'frame_length': 25, 'frame_shif t': 10, 'num_mel_bins': 80}, 'filter_conf': {'max_length': 1200, 'min_length': 10, 'token_max_length': 100, 'token_min_length': 1}, 'resample_c onf': {'resample_rate': 16000}, 'shuffle': True, 'shuffle_conf': {'shuffle_size': 20000}, 'sort': True, 'sort_conf': {'sort_size': 2000}, 'spec _aug': True, 'spec_aug_conf': {'max_f': 30, 'max_t': 50, 'num_f_mask': 2, 'num_t_mask': 2}, 'speed_perturb': True}, 'decoder': 'bitransformer', 'decoder_conf': {'attention_heads': 8, 'dropout_rate': 0.1, 'linear_units': 2048, 'num_blocks': 3, 'positional_dropout_rate': 0.1, 'r_num_bloc ks': 3, 'self_attention_dropout_rate': 0.1, 'src_attention_dropout_rate': 0.1}, 'dtype': 'fp32', 'encoder': 'conformer', 'encoder_conf': {'acti vation_type': 'swish', 'attention_dropout_rate': 0.1, 'attention_heads': 8, 'causal': True, 'cnn_module_kernel': 15, 'cnn_modulenorm': 'layer norm', 'dropout_rate': 0.1, 'input_layer': 'conv2d', 'linear_units': 2048, 'normalize_before': True, 'num_blocks': 12, 'output_size': 512, 'pos _enc_layer_type': 'rel_pos', 'positional_dropout_rate': 0.1, 'selfattention_layer_type': 'rel_selfattn', 'use_cnn_module': True, 'use_dynamic_c hunk': True, 'use_dynamic_left_chunk': False}, 'grad_clip': 5, 'input_dim': 80, 'log_interval': 500, 'max_epoch': 640, 'model': 'asr_model', 'm odel_conf': {'ctc_weight': 0.3, 'length_normalized_loss': False, 'lsm_weight': 0.1, 'reverse_weight': 0.3}, 'model_dir': '/home/SpeechRecogniti oin/wenet/examples/aishell/s0/20220506_u2pp_conformer_exp_wenetspeech/', 'optim': 'adam', 'optim_conf': {'lr': 0.002}, 'output_dim': 5538, 'sav e_states': 'model_only', 'scheduler': 'warmuplr', 'scheduler_conf': {'warmup_steps': 100000}, 'tokenizer': 'char', 'tokenizer_conf': {'bpe_path ': None, 'is_multilingual': False, 'non_lang_syms_path': None, 'num_languages': 1, 'special_tokens': {'': 0, '': 2, '': 2, '': 1}, 'split_with_space': False, 'symbol_table_path': '/home/SpeechRecognitioin/wenet/examples/aishell/s0/20220506_u2pp_conformer_exp_wenet speech//units.txt'}, 'train_engine': 'torch_ddp', 'use_amp': False, 'vocab_size': 5538, 'init_infos': {}} 2024-07-09 15:28:24,030 INFO [Rank 0] Checkpoint: save to checkpoint /home/SpeechRecognitioin/wenet/examples/aishell/s0/20220506_u2pp_conformer _exp_wenetspeech/init.pt 2024-07-09 15:28:28,272 INFO Epoch 0 Step 0 TRAIN info lr 2.0000e-08 rank 0 2024-07-09 15:28:28,288 INFO using accumulate grad, new batch size is 2 times larger than before Fatal Python error: Segmentation fault

Thread 0x00007f3170ffd700 (most recent call first): File "/root/anaconda3/envs/fspeech/lib/python3.10/threading.py", line 320 in wait File "/root/anaconda3/envs/fspeech/lib/python3.10/multiprocessing/queues.py", line 233 in _feed File "/root/anaconda3/envs/fspeech/lib/python3.10/threading.py", line 946 in run File "/root/anaconda3/envs/fspeech/lib/python3.10/threading.py", line 1009 in _bootstrap_inner File "/root/anaconda3/envs/fspeech/lib/python3.10/threading.py", line 966 in _bootstrap

Thread 0x00007f31f26d2700 (most recent call first): File "/root/anaconda3/envs/fspeech/lib/python3.10/selectors.py", line 416 in select File "/root/anaconda3/envs/fspeech/lib/python3.10/multiprocessing/connection.py", line 936 in wait File "/root/anaconda3/envs/fspeech/lib/python3.10/multiprocessing/connection.py", line 429 in _poll File "/root/anaconda3/envs/fspeech/lib/python3.10/multiprocessing/connection.py", line 262 in poll File "/root/anaconda3/envs/fspeech/lib/python3.10/multiprocessing/queues.py", line 113 in get File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 31 in do_one_step File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 54 in _pin_memory_loop File "/root/anaconda3/envs/fspeech/lib/python3.10/threading.py", line 946 in run File "/root/anaconda3/envs/fspeech/lib/python3.10/threading.py", line 1009 in _bootstrap_inner File "/root/anaconda3/envs/fspeech/lib/python3.10/threading.py", line 966 in _bootstrap

Thread 0x00007f31a1ff3700 (most recent call first): File "/root/anaconda3/envs/fspeech/lib/python3.10/selectors.py", line 416 in select File "/root/anaconda3/envs/fspeech/lib/python3.10/multiprocessing/connection.py", line 936 in wait File "/root/anaconda3/envs/fspeech/lib/python3.10/multiprocessing/connection.py", line 429 in _poll File "/root/anaconda3/envs/fspeech/lib/python3.10/multiprocessing/connection.py", line 262 in poll File "/root/anaconda3/envs/fspeech/lib/python3.10/multiprocessing/queues.py", line 113 in get File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/tensorboardX/event_file_writer.py", line 202 in run File "/root/anaconda3/envs/fspeech/lib/python3.10/threading.py", line 1009 in _bootstrap_inner File "/root/anaconda3/envs/fspeech/lib/python3.10/threading.py", line 966 in _bootstrap

Current thread 0x00007f3304fce740 (most recent call first): File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456 in _conv_forward File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460 in forward File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217 in forward File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl File "/home/SpeechRecognitioin/wenet/wenet/transformer/subsampling.py", line 224 in forward File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl File "/home/SpeechRecognitioin/wenet/wenet/transformer/encoder.py", line 158 in forward File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl File "/home/SpeechRecognitioin/wenet/wenet/transformer/asr_model.py", line 95 in forward File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1411 in _run_ddp_forward File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1593 in forward File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541 in _call_impl File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl File "/home/SpeechRecognitioin/wenet/wenet/utils/train_utils.py", line 660 in batch_forward File "/home/SpeechRecognitioin/wenet/wenet/utils/executor.py", line 84 in train File "/home/SpeechRecognitioin/wenet/examples/aishell/s0/wenet/bin/train.py", line 154 in main File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347 in wr apper File "/home/SpeechRecognitioin/wenet/examples/aishell/s0/wenet/bin/train.py", line 186 in

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy .random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, t orch._C._sparse, torch._C._special, yaml._yaml, _cffi_backend, charset_normalizer.md, simplejson._speedups, requests.packages.charset_normalize r.md, requests.packages.chardet.md, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, num ba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, psutil._psutil_linux, psutil._psuti l_posix, cython.cimports.libc.math, sentencepiece._sentencepiece, google._upb._message (total: 40) E0709 15:29:34.868538 139688909059904 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -11) local_rank: 0 (pid: 15346) of binary: /root/anac onda3/envs/fspeech/bin/python3.1 Traceback (most recent call last): File "/root/anaconda3/envs/fspeech/bin/torchrun", line 8, in sys.exit(main()) File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main run(args) File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run elastic_launch( File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/anaconda3/envs/fspeech/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

wenet/bin/train.py FAILED

Failures:

------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2024-07-09_15:29:34 host : redis.chaboshi.cn rank : 0 (local_rank: 0) exitcode : -11 (pid: 15346) error_file: traceback : Signal 11 (SIGSEGV) received by PID 15346 ======================================================= torch 2.3.1+cu121 torch-complex 0.4.4 torchaudio 2.3.1+cu121 torchmetrics 1.4.0.post0 torchvision 0.18.1+cu121
T-freedom commented 3 months ago

请问怎么解决的?