yeyupiaoling / MASR

Pytorch实现的流式与非流式的自动语音识别框架,同时兼容在线和离线识别,目前支持Conformer、Squeezeformer、DeepSpeech2模型,支持多种数据增强方法。
Apache License 2.0
572 stars 100 forks source link

多GPU训练不起作用? #11

Closed wchuan163 closed 2 years ago

wchuan163 commented 3 years ago

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py 这个项目多GPU训练不了?

yeyupiaoling commented 3 years ago

@wchuan163 不支持多卡训练

wchuan163 commented 3 years ago

WARNING:root:NaN or Inf found in input tensor. [1/200][300/3986] Loss = inf Remain time: 5 days, 23:14:31 这个大概会是什么问题?

wchuan163 commented 3 years ago

----------- Configuration Arguments ----------- batch_size: 64 dev_manifest_path: dataset/manifest.dev epochs: 200 learning_rate: 0.6 restore_model: None save_model_path: save_model/ train_manifest_path: dataset/manifest.train vocab_path: dataset/zh_vocab.json

[1/200][0/3986] Loss = 1889.3643 Remain time: 7 days, 1:28:46 [1/200][100/3986] Loss = 118.5203 Remain time: 5 days, 16:50:40 [1/200][200/3986] Loss = 101.8777 Remain time: 6 days, 0:33:07 WARNING:root:NaN or Inf found in input tensor. [1/200][300/3986] Loss = inf Remain time: 5 days, 23:14:31 WARNING:root:NaN or Inf found in input tensor. [1/200][400/3986] Loss = inf Remain time: 6 days, 1:10:09 WARNING:root:NaN or Inf found in input tensor. [1/200][500/3986] Loss = inf Remain time: 5 days, 14:36:45 WARNING:root:NaN or Inf found in input tensor. [1/200][600/3986] Loss = inf Remain time: 6 days, 0:25:17 WARNING:root:NaN or Inf found in input tensor. [1/200][700/3986] Loss = inf Remain time: 5 days, 20:21:25 WARNING:root:NaN or Inf found in input tensor. [1/200][800/3986] Loss = inf Remain time: 5 days, 13:28:04 WARNING:root:NaN or Inf found in input tensor. [1/200][900/3986] Loss = inf Remain time: 7 days, 10:01:37 WARNING:root:NaN or Inf found in input tensor. [1/200][1000/3986] Loss = inf Remain time: 6 days, 15:41:54 WARNING:root:NaN or Inf found in input tensor. [1/200][1100/3986] Loss = inf Remain time: 5 days, 7:41:54 WARNING:root:NaN or Inf found in input tensor. [1/200][1200/3986] Loss = inf Remain time: 5 days, 10:18:43 WARNING:root:NaN or Inf found in input tensor. [1/200][1300/3986] Loss = inf Remain time: 6 days, 23:33:22

wchuan163 commented 3 years ago

WARNING:root:NaN or Inf found in input tensor. [1/200][1400/3986] Loss = inf Remain time: 5 days, 6:55:38 [1/200][1500/3986] Loss = 94.9272 Remain time: 5 days, 19:39:34

yeyupiaoling commented 3 years ago

@wchuan163 继续训练看看,没出现nan就可以

wode123 commented 3 years ago
model,
epochs=1000,
batch_size=64,
train_index_path="/home/lukhy/data_aishell/train-sort.manifest",
dev_index_path="/home/lukhy/data_aishell/dev.manifest",
labels_path="/home/lukhy/data_aishell/labels.json",
learning_rate=0.6,
momentum=0.8,
max_grad_norm=0.2,
weight_decay=0,

您好 用这些超参训练,loss = nan
请问,作者,您在训练的时候采用的参数