yeyupiaoling / PPASR

基于PaddlePaddle实现端到端中文语音识别,从入门到实战,超简单的入门案例,超实用的企业项目。支持当前最流行的DeepSpeech2、Conformer、Squeezeformer模型
Apache License 2.0
807 stars 128 forks source link

Container rank 6 status failed cmd ['/disk1/fxh/anaconda/envs/paddle/bin/python', '-u', 'train.py'] code -6 log log/workerlog.6 #118

Closed Tian14267 closed 1 year ago

Tian14267 commented 1 year ago

大神你好,我在使用这个U2训练的时候,报如下错误: `[2022-11-09 09:52:48.575314 INFO ] trainer:train_epoch:280 - Train epoch: [1/100], batch: [84400/85957], loss: 5.19813, learning rate: 0.00084408, speed: 8.33 data/sec, eta: 378 days, 12:20:13 [2022-11-09 09:59:13.003253 INFO ] trainer:__train_epoch:280 - Train epoch: [1/100], batch: [84500/85957], loss: 4.45458, learning rate: 0.00084508, speed: 8.32 data/sec, eta: 378 days, 16:41:54 [2022-11-09 10:04:47.221242 INFO ] trainer:train_epoch:280 - Train epoch: [1/100], batch: [84600/85957], loss: 2.02312, learning rate: 0.00084608, speed: 9.57 data/sec, eta: 329 days, 5:31:47 [2022-11-09 10:09:18.908887 INFO ] trainer:train_epoch:280 - Train epoch: [1/100], batch: [84700/85957], loss: 5.01856, learning rate: 0.00084708, speed: 11.78 data/sec, eta: 267 days, 15:06:41 LAUNCH INFO 2022-11-09 10:11:32,053 Pod failed LAUNCH ERROR 2022-11-09 10:11:32,058 Container failed !!! Container rank 6 status failed cmd ['/disk1/fxh/anaconda/envs/paddle/bin/python', '-u', 'train.py'] code -6 log log/workerlog.6 env {'CONDA_SHLVL': '2', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31: .lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.Z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.j ar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01 ;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35: .nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00; 36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:.xspf=00;36:', 'CONDA_EXE': '/disk1/fxh/anaconda/bin/conda', 'SSHCONNECTION': '10.10.208.127 64380 192.168.193.20 22', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', '': '/disk1 /fxh/anaconda/envs/paddle/bin/python', 'LANG': 'en_US.UTF-8', 'DISPLAY': 'localhost:11.0', 'OLDPWD': '/disk1/fffan/0_paddlespeech/PPASR_u2/models', 'CONDA_PREFIX': '/disk1/fxh/anaconda/envs/paddle', '_CE_M': '', 'CLASSPATH': '$:CLASSPATH:/usr/local/java/jdk1.8.0_202/lib/', 'XDG_SESSI ON_ID': '318', 'USER': 'root', 'CONDA_PREFIX_1': '/disk1/fxh/anaconda', 'PWD': '/disk1/fffan/0_paddlespeech/PPASR_u2', 'HOME': '/root', 'CONDA_PYTHON_EXE': '/disk1/fxh/anaconda/bin/python', 'SSH_CLIENT': '10.10.208.127 64380 22', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib /snapd/desktop', '_CE_CONDA': '', 'CONDA_PROMPT_MODIFIER': '(paddle) ', 'SSH_TTY': '/dev/pts/1', 'MAIL': '/var/mail/root', 'SHELL': '/bin/bash', 'TERM': 'xterm', 'SHLVL': '2', 'LOGNAME': 'root', 'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/0/bus', 'XDG_RUNTIME_DIR': '/run/user/0' , 'PATH': '/home/tools/kaldi/tools/openfst-1.7.2/bin:/home/tools/srilm/bin/i686-m64:/disk1/fxh/anaconda/envs/paddle/bin:/disk1/fxh/anaconda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/java/jdk1.8.0_202/bin', ' CONDA_DEFAULT_ENV': 'paddle', 'LESSOPEN': '| /usr/bin/lesspipe %s', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'POD_NAME': 'xozsmc', 'PADDLE_MASTER': '192.168.193.20:62565', 'PADDLE_GLOBAL_SIZE': '7', 'PADDLE_LOCAL_SIZE': '7', 'PADDLE_GLOBAL_RANK': '6', 'PADDLE_LOCAL_RANK': '6 ', 'PADDLE_NNODES': '1', 'PADDLE_TRAINER_ENDPOINTS': '192.168.193.20:62566,192.168.193.20:62567,192.168.193.20:62568,192.168.193.20:62569,192.168.193.20:62570,192.168.193.20:62571,192.168.193.20:62572', 'PADDLE_CURRENT_ENDPOINT': '192.168.193.20:62572', 'PADDLE_TRAINER_ID': '6', 'PAD DLE_TRAINERS_NUM': '7', 'PADDLE_RANK_IN_NODE': '6', 'FLAGS_selected_gpus': '7'} LAUNCH INFO 2022-11-09 10:11:32,058 ------------------------- ERROR LOG DETAIL ------------------------- s:print_arguments:29 - use_model: conformer_offline [2022-11-07 01:24:40.613560 INFO ] utils:print_arguments:30 - ------------------------------------------------ I1107 01:24:40.622406 3509 tcp_utils.cc:130] Successfully connected to 192.168.193.20:62565 W1107 01:24:44.470811 3509 gpu_resources.cc:61] Please NOTE: device: 7, GPU Compute Capability: 6.1, Driver API Version: 11.2, Runtime API Version: 10.2 W1107 01:24:44.474025 3509 gpu_resources.cc:91] device: 7, cuDNN Version: 8.1. 2022-11-07 01:24:47,633-INFO: [topology.py:187:init__] HybridParallelInfo: rank_id: 6, mp_degree: 1, sharding_degree: 1, pp_degree: 1, dp_degree: 7, mp_group: [6], sharding_group: [6], pp_group: [6], dp_group: [0, 1, 2, 3, 4, 5, 6], check/clip group: [6] [2022-11-07 01:24:47.647099 WARNING] augmentation:_parse_pipeline_from:126 - dataset/manifest.noise不存在,已经忽略噪声增强操作! [2022-11-07 01:24:47.647398 INFO ] augmentation:_parse_pipeline_from:128 - 数据增强配置:{'type': 'resample', 'aug_type': 'audio', 'params': {'new_sample_rate': [8000, 32000, 44100, 48000]}, 'prob': 0.0} [2022-11-07 01:24:47.647540 INFO ] augmentation:_parse_pipeline_from:128 - 数据增强配置:{'type': 'speed', 'aug_type': 'audio', 'params': {'min_speed_rate': 0.9, 'max_speed_rate': 1.1, 'num_rates': 3}, 'prob': 1.0} [2022-11-07 01:24:47.647657 INFO ] augmentation:_parse_pipeline_from:128 - 数据增强配置:{'type': 'shift', 'aug_type': 'audio', 'params': {'min_shift_ms': -5, 'max_shift_ms': 5}, 'prob': 1.0} [2022-11-07 01:24:47.647768 INFO ] augmentation:_parse_pipeline_from:128 - 数据增强配置:{'type': 'volume', 'aug_type': 'audio', 'params': {'min_gain_dBFS': -15, 'max_gain_dBFS': 15}, 'prob': 1.0} [2022-11-07 01:24:47.648630 INFO ] augmentation:_parse_pipeline_from:128 - 数据增强配置:{'type': 'specaug', 'aug_type': 'feature', 'params': {'inplace': True, 'max_time_warp': 5, 'max_t_ratio': 0.05, 'n_freq_masks': 2, 'max_f_ratio': 0.15, 'n_time_masks': 2, 'replace_with_zero': F alse}, 'prob': 1.0} [2022-11-07 01:27:30.434729 INFO ] trainer:train:404 - 训练数据:19254485 /disk1/fxh/anaconda/envs/paddle/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py:277: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.int64, but right dtype is paddle.int32, the right dtype will convert to paddle.int64 .format(lhs_dtype, rhs_dtype, lhs_dtype))


C++ Traceback (most recent call last):

0 paddle::pybind::ThrowExceptionToPython(std::__exception_ptr::exception_ptr)


Error Message Summary:

FatalError: Process abort signal is detected by the operating system. [TimeInfo: Aborted at 1667988686 (unix time) try "date -d @1667988686" if you are using GNU date ] [SignalInfo: SIGABRT (@0xdb5) received by PID 3509 (TID 0x7f5790dd40c0) from PID 3509 ] ` 我不太明白是什么原因。我是多卡训练的,7张。训练数据:19254485。这一个epoch都快结束了才出这个问题。请问是啥情况?

yeyupiaoling commented 1 year ago

首先考虑是否为显存不足,因为第一个epoch训练的音频是从短到长的,所以越到后面就越消耗显存。

Tian14267 commented 1 year ago

我今天检查了一下,把长音频踢掉了,现在再试试。只是没看到 Memory 相关提示,所以不确定是不是这个原因。谢谢大神。

yeyupiaoling commented 1 year ago

我也没看到,但是前面都正常训练,我只能是这样怀疑了。