training problem - Githubissues

jialesmu commented 3 years ago

docker~~~CUDA_VISIBLE_DEVICES=0,1 python train.py

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
-----------  Configuration Arguments -----------
augment_conf_path: ./conf/augmentation.config
batch_size: 4
dev_manifest: ./dataset/manifest.test
init_from_pretrained_model: None
learning_rate: 5e-05
max_duration: 15.0
mean_std_path: ./dataset/mean_std.npz
min_duration: 1.0
num_conv_layers: 2
num_epoch: 200
num_rnn_layers: 3
output_model_dir: ./models
rnn_layer_size: 2048
share_rnn_weights: False
shuffle_method: batch_shuffle_clipped
test_off: False
train_manifest: ./dataset/manifest.train
use_gpu: True
use_gru: True
vocab_path: ./dataset/zh_vocab.txt
------------------------------------------------
dataset/manifest.noise不存在，已经忽略噪声增强操作！
[2021-08-04 07:00:17.697473] 训练数据数量：102394

W0804 07:00:19.211035   201 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.1, Runtime API Version: 10.2
W0804 07:00:22.748769   201 device_context.cc:422] device: 0, cuDNN Version: 7.6.
W0804 07:42:36.315891   201 operator.cc:242] gaussian_random raises an exception thrust::system::system_error, parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device
Traceback (most recent call last):
  File "train.py", line 94, in <module>
    main()
  File "train.py", line 90, in main
    train()
  File "train.py", line 85, in train
    test_off=args.test_off)
  File "/home/DeepSpeech/model_utils/model.py", line 285, in train
    exe.run(startup_prog)
  File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1110, in run
    six.reraise(*sys.exc_info())
  File "/usr/local/python3.7.0/lib/python3.7/site-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1108, in run
    return_merged=return_merged)
  File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1239, in _run_impl
    use_program_cache=use_program_cache)
  File "/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1329, in _run_program
    [fetch_var_name])
RuntimeError: parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device

请问大佬这个原因是因为显卡驱动吗还是？NVIDIA-SMI 455.28 Driver Version: 455.28 CUDA Version: 11.1 GeForce RTX 3090

yeyupiaoling commented 3 years ago

@jialesmu 你用的是nvidia-docker吗？试试单卡是否正常？如果正常，就要看nccl环境有没有安装了。

jialesmu commented 3 years ago

@jialesmu 你用的是nvidia-docker吗？试试单卡是否正常？如果正常，就要看nccl环境有没有安装了。

谢谢大佬回复，是的，我pull的nvidia-docker, 我看train的时候显示 Driver API Version: 11.1, Runtime API Version: 10.2,是不是这个因素呢？

yeyupiaoling commented 3 years ago

这个应该没关系的。单卡训练正常吗

jialesmu commented 3 years ago

这个应该没关系的。单卡训练正常吗

稍等，我运行下。。不过我有俩显卡，应该不会吧。。

jialesmu commented 3 years ago

这个应该没关系的。单卡训练正常吗

大佬，单卡训练还是报了同样的错误

yeyupiaoling commented 3 years ago

@jialesmu 你可以在本地使用anaconda搭建环境吗？

jialesmu commented 3 years ago

@jialesmu 你可以在本地使用anaconda搭建环境吗？

我是在服务器上搭建的docker，本地是mac无法使用显卡。。

yeyupiaoling commented 3 years ago

在服务器，不用docker。

jialesmu commented 3 years ago

大佬，我试试～

yeyupiaoling / PaddlePaddle-DeepSpeech

training problem #57