新的大训练集执行create_data.py报错ValueError: negative dimensions are not allowed

a00147600 commented 2 years ago

看了一圈网上的答案是内存爆了导致的说法较多请大佬指点一二这是我的终端：

`(ppasr) F:\PPASR-master>python create_data.py
F:\Anaconda3\envs\ppasr\lib\site-packages\librosa\core\constantq.py:1058: DeprecationWarning: `np.complex` is a deprecated alias for the builtin `complex`. To silence this warning, use `complex` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.complex128` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.complex,
F:\PPASR-master\ppasr\data_utils\augmentor\spec_augment.py:5: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
  from PIL.Image import BICUBIC
-----------  Configuration Arguments -----------
annotation_path: dataset/annotation/
count_threshold: 2
dataset_vocab: dataset/vocabulary.txt
feature_method: linear
is_change_frame_rate: False
max_test_manifest: 10000
mean_std_path: dataset/mean_std.npz
noise_manifest_path: dataset/manifest.noise
noise_path: dataset/audio/noise
num_samples: 350000
num_workers: 8
test_manifest: dataset/manifest.test
train_manifest: dataset/manifest.train
------------------------------------------------
开始生成数据列表...
100%|█████████████████████████████████████████████████████████████████████████| 367619/367619 [13:44<00:00, 445.91it/s]
完成生成数据列表，数据集总长度为353.06小时！
======================================================================
开始生成噪声数据列表...
正在创建噪声数据列表，路径：dataset/audio/noise，请等待 ...
100%|██████████████████████████████████████████████████████████████████████████████████| 21262/21262 [01:42<00:00, 207.14it/s]
======================================================================
开始生成数据字典...
100%|██████████████████████████████████████████████████████████████████████████████| 366883/366883 [00:07<00:00, 49622.18it/s]
100%|████████████████████████████████████████████████████████████████████████████████████| 736/736 [00:00<00:00, 49108.47it/s]
数据字典生成完成！
======================================================================
开始抽取350000条数据计算均值和标准值...
F:\Anaconda3\envs\ppasr\lib\site-packages\paddle\fluid\reader.py:481: UserWarning: DataLoader with multi-process mode is not supported on MacOs and Windows currently. Please use signle-process mode with num_workers = 0 instead
  "DataLoader with multi-process mode is not supported on MacOs and Windows currently." \
  0%|                                                                                     | 1/5469 [00:17<26:13:06, 17.26s/it]W0623 14:40:42.116008 106504 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 10.2
W0623 14:40:42.279247 106504 gpu_context.cc:306] device: 0, cuDNN Version: 7.6.
 33%|███████████████████████████▌                                                       | 1818/5469 [51:49<1:35:08,  1.56s/it]F:\Anaconda3\envs\ppasr\lib\site-packages\numpy\core\fromnumeric.py:3441: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
F:\Anaconda3\envs\ppasr\lib\site-packages\numpy\core\_methods.py:189: RuntimeWarning: invalid value encountered in true_divide
  ret = ret.dtype.type(ret / rcount)
Exception in thread Thread-4:
Traceback (most recent call last):
  File "F:\Anaconda3\envs\ppasr\lib\threading.py", line 926, in _bootstrap_inner
    self.run()
  File "F:\Anaconda3\envs\ppasr\lib\threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "F:\Anaconda3\envs\ppasr\lib\site-packages\paddle\fluid\dataloader\dataloader_iter.py", line 216, in _thread_loop
    self._thread_done_event)
  File "F:\Anaconda3\envs\ppasr\lib\site-packages\paddle\fluid\dataloader\fetcher.py", line 121, in fetch
    data.append(self.dataset[idx])
  File "F:\PPASR-master\ppasr\data_utils\normalizer.py", line 114, in __getitem__
    feature = self.audio_featurizer.featurize(audio)
  File "F:\PPASR-master\ppasr\data_utils\featurizer\audio_featurizer.py", line 71, in featurize
    stride_ms=self._stride_ms, window_ms=self._window_ms)
  File "F:\PPASR-master\ppasr\data_utils\featurizer\audio_featurizer.py", line 92, in _compute_linear
    windows = np.lib.stride_tricks.as_strided(samples, shape=nshape, strides=nstrides)
  File "F:\Anaconda3\envs\ppasr\lib\site-packages\numpy\lib\stride_tricks.py", line 104, in as_strided
    array = np.asarray(DummyArray(interface, base=x))
ValueError: negative dimensions are not allowed`

yeyupiaoling commented 2 years ago

如果是内存爆了，可以试试改这个，batch_size设置为32试试。你的内存是多少？ https://github.com/yeyupiaoling/PPASR/blob/5920316281e5ba194e2ceb822c49f5eb7d3af253/ppasr/data_utils/normalizer.py#L79

a00147600 commented 2 years ago

如果是内存爆了，可以试试改这个，batch_size设置为32试试。你的内存是多少？

https://github.com/yeyupiaoling/PPASR/blob/5920316281e5ba194e2ceb822c49f5eb7d3af253/ppasr/data_utils/normalizer.py#L79

我的内存是16G 出问题的是参数num_samples部分我使用-1时会出现这个错误我改为100000后正常通过了

yeyupiaoling commented 2 years ago

好的

a00147600 commented 2 years ago

好的

这边还想问另外的问题。如果loss率后期下降不够明显，我观察到学习几十轮后学习率已经非常低，我修改学习率是否会带来帮助呢

yeyupiaoling commented 2 years ago

不需要改了，应该是差不多拟合了。看看字错率怎么样就行

a00147600 commented 2 years ago

不需要改了，应该是差不多拟合了。看看字错率怎么样就行

训练的打印结果如下
======================================================================
[2022-06-28 08:39:59.667478] Test batch: [0/23], loss: 10.72764, cer: 0.70104
[2022-06-28 08:40:06.449721] Test batch: [10/23], loss: 11.15109, cer: 0.28308
[2022-06-28 08:40:12.003567] Test batch: [20/23], loss: 33.25569, cer: 0.25520
[2022-06-28 08:40:14.156604] Test epoch: 37, time/epoch: 2:47:49.769075, loss: 18.04504, cer: 0.32284
======================================================================
loss稳定在18左右 我在linux上跑了tools/tune.py文件 得出最佳【最后结果】当alpha为：1.000000, beta为：0.100000，cer最低，为：0.424453。

其实翻译的质量还算不错的，但是loss感觉还有压缩的空间

yeyupiaoling commented 2 years ago

你数据集这么大，字错率不应该这么低。你有改动过其他什么参数吗？

a00147600 commented 2 years ago

你数据集这么大，字错率不应该这么低。你有改动过其他什么参数吗？

这是我的create_data.py文件参数 add_arg = functools.partial(add_arguments, argparser=parser) add_arg('annotation_path', str, 'dataset/annotation/', '标注文件的路径') add_arg('train_manifest', str, 'dataset/manifest.train', '训练数据的数据列表路径') add_arg('test_manifest', str, 'dataset/manifest.test', '测试数据的数据列表路径') add_arg('is_change_frame_rate', bool, False, '是否统一改变音频为16000Hz，这会消耗大量的时间') add_arg('max_test_manifest', int, 10000, '生成测试数据列表的最大数量，如果annotation_path包含了test.txt，就全部使用test.txt的数据') add_arg('count_threshold', int, 2, '字符计数的截断阈值，0为不做限制') add_arg('dataset_vocab', str, 'dataset/vocabulary.txt', '生成的数据字典文件') add_arg('num_workers', int, 8, '读取数据的线程数量') add_arg('num_samples', int, 100000, '用于计算均值和标准值得音频数量，当为-1使用全部数据') add_arg('mean_std_path', str, 'dataset/mean_std.npz', '保存均值和标准值得numpy文件路径，后缀 (.npz).') add_arg('noise_path', str, 'dataset/audio/noise', '噪声音频存放的文件夹路径') add_arg('noise_manifest_path', str, 'dataset/manifest.noise', '噪声数据列表的路径') add_arg('feature_method', str, 'linear', '音频预处理方法', choices=['linear', 'mfcc', 'fbank'])

这是我的train.py参数 add_arg('batch_size', int, 32, '训练的批量大小') add_arg('num_workers', int, 8, '读取数据的线程数量') add_arg('num_epoch', int, 65, '训练的轮数') add_arg('learning_rate', int, 5e-5, '初始学习率的大小') # 默认5e-5 add_arg('min_duration', int, 0.5, '过滤最短的音频长度') add_arg('max_duration', int, 20, '过滤最长的音频长度，当为-1的时候不限制长度') add_arg('use_model', str, 'deepspeech2', '所使用的模型', choices=['deepspeech2', 'deepspeech2_big']) add_arg('train_manifest', str, 'dataset/manifest.train', '训练数据的数据列表路径') add_arg('test_manifest', str, 'dataset/manifest.test', '测试数据的数据列表路径') add_arg('dataset_vocab', str, 'dataset/vocabulary.txt', '数据字典的路径') add_arg('mean_std_path', str, 'dataset/mean_std.npz', '数据集的均值和标准值的npy文件路径') add_arg('augment_conf_path',str, 'conf/augmentation.json', '数据增强的配置文件，为json格式') add_arg('save_model_path', str, 'models/', '模型保存的路径') add_arg('feature_method', str, 'linear', '音频预处理方法', choices=['linear', 'mfcc', 'fbank']) add_arg('metrics_type', str, 'cer', '计算错误率方法', choices=['cer', 'wer']) add_arg('resume_model', str, None, '恢复训练，当为None则不使用预训练模型') add_arg('pretrained_model', str, "./models/deepspeech2/yuxunlian_model_no0621", '预训练模型的路径，当为None则不使用预训练模型') 除此之外没有改动了。翻译的质量波动幅度较大。

yeyupiaoling commented 2 years ago

你就改了预训练模型这个吗？

a00147600 commented 2 years ago

你就改了预训练模型这个吗？

应该是

yeyupiaoling commented 2 years ago

你训练65轮只是，用评估程序eval.py执行看看，评估程序会用集束搜索解码，准确率会高一些

a00147600 commented 2 years ago

你训练65轮只是，用评估程序eval.py执行看看，评估程序会用集束搜索解码，准确率会高一些


这是我执行命令python eval.py --resume_model=models/deepspeech2/best_model的结果
==================================================================
缺少 paddlespeech-ctcdecoders 库，请安装，如果是Windows系统，只能使用ctc_greedy。
【注意】已自动切换为ctc_greedy解码器。
==================================================================

100%|██████████████████████████████████████████████████████████████████████████████████| 23/23 [00:16<00:00,  1.41it/s]
评估消耗时间：21s，cer：0.31460

(ppasr) F:\PPASR-master>

yeyupiaoling commented 2 years ago

在Ubuntu用集束搜索解码啊

a00147600 commented 2 years ago

在Ubuntu用集束搜索解码啊

集数搜索解码是ctc_beam_search参数吧 目前卡在这里不动了。。。
======================================================================
初始化解码器...
language model: is_character_based = 1, max_order = 5, dict_size = 0
初始化解码器完成!
======================================================================
96%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋      | 22/23 [01:29<00:06,  6.84s/it]

yeyupiaoling commented 2 years ago

这里卡很久了吗？要不试试使用短一点语音，这里改成20 https://github.com/yeyupiaoling/PPASR/blob/5920316281e5ba194e2ceb822c49f5eb7d3af253/eval.py#L12

a00147600 commented 2 years ago

这里卡很久了吗？要不试试使用短一点语音，这里改成20

https://github.com/yeyupiaoling/PPASR/blob/5920316281e5ba194e2ceb822c49f5eb7d3af253/eval.py#L12

改成20后结果出来了。。

======================================================================
初始化解码器...
language model: is_character_based = 1, max_order = 5, dict_size = 0
初始化解码器完成!
======================================================================
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [01:43<00:00,  4.49s/it]
评估消耗时间：107s，cer：0.37483

yeyupiaoling commented 2 years ago

字错率怎么会高了呢？你的数据集是怎样的？

a00147600 commented 2 years ago

字错率怎么会高了呢？你的数据集是怎样的？

数据集因为量很大。用了之前腾讯云给出的识别结果，人工去听的话效率不够。一共是36W条音频。总长度350个小时左右。不过音频的质量实在一般，环境杂音口音非常普遍。

yeyupiaoling commented 2 years ago

是不是纯中文的，

a00147600 commented 2 years ago

是不是纯中文的，

可以保证是纯中文的。我制作my_audio.txt的时候就确保了这一点。

yeyupiaoling commented 2 years ago

要不你试试使用deepspeech2_big模型看看，

a00147600 commented 2 years ago

要不你试试使用deepspeech2_big模型看看，

deepspeech2_big这个模型有具体链接么？实际上，我用了linux的 ctc_beam_search解码后准确率进一步提升并且字错率也是下降的。字错率在我看来主要由于数字和号码部分效果不好。当然是我训练集这方面存在转中文不精确的问题。例：137577 中文幺三七五七七 2000 中文两千

yeyupiaoling commented 2 years ago

使用deepspeech2_big看这个 https://github.com/yeyupiaoling/PPASR/blob/5920316281e5ba194e2ceb822c49f5eb7d3af253/train.py#L15

如果你的数据集本身就有问题，那就很影响训练了

a00147600 commented 2 years ago

使用deepspeech2_big看这个

https://github.com/yeyupiaoling/PPASR/blob/5920316281e5ba194e2ceb822c49f5eb7d3af253/train.py#L15

如果你的数据集本身就有问题，那就很影响训练了

谢谢如果我要开始使用deepspeech2_big的话是否需要重新从第一轮训练开始呢？其实我对当前的模型翻译质量100分的话能打到80分以上了。

yeyupiaoling commented 2 years ago

要的，这个模型结构不一样。如果用ctc_beam_search（集束）解码，得分是低比较好的，贪心解码则相反

yeyupiaoling / PPASR

新的大训练集执行create_data.py报错ValueError: negative dimensions are not allowed #84