yeyupiaoling / MASR

Pytorch实现的流式与非流式的自动语音识别框架,同时兼容在线和离线识别,目前支持Conformer、Squeezeformer、DeepSpeech2模型,支持多种数据增强方法。
Apache License 2.0
563 stars 100 forks source link

online 和offline自己炼的话 相同条件下是不是offline效果好点? #55

Closed 2651084156 closed 1 year ago

yeyupiaoling commented 1 year ago

如果没有流式需求的话,offline是效果更好一些。

2651084156 commented 1 year ago

图片 这个数据预处理好慢有方法加速吗?

2651084156 commented 1 year ago

图片 感觉好像是cpu瓶颈 不过合并音频不应该吧

yeyupiaoling commented 1 year ago

合并是很消耗时间的,如果你数据量不大的话,完全没必要合并,我看你数据才17W,没必要合并

2651084156 commented 1 year ago

那合并 一般是多少 以上需要我这只是先拿个小的测试一下

yeyupiaoling commented 1 year ago

没上千万不需要考虑,如果自己硬盘读取去数据够快也不需要考虑。

2651084156 commented 1 year ago

对了这个是边合并边删除源文件还是处理完在删除

2651084156 commented 1 year ago

ssd 基本上不用合并了是吧?

2651084156 commented 1 year ago

Traceback (most recent call last): File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\create_data.py", line 33, in trainer.create_data(annotation_path=args.annotation_path, File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\masr\trainer.py", line 391, in create_data create_manifest(annotation_path=annotation_path, File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\masr\utils\utils.py", line 129, in create_manifest change_rate(audio_path, target_sr=target_sr) File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\masr\utils\utils.py", line 249, in change_rate wav = resampy.resample(wav, sr_orig=samplerate, sr_new=target_sr) File "C:\ProgramData\Anaconda3\envs\py10\lib\site-packages\resampy\core.py", line 117, in resample raise ValueError( ValueError: Input signal length=0 is too small to resample from 32000->16000

进程已结束,退出代码1 直接甩了个报错emm

2651084156 commented 1 year ago

等下 这东西合并文件支持非wav 不合并的话就只能wav?

2651084156 commented 1 year ago

图片 但是还是好慢 有考虑多进程处理吗?

yeyupiaoling commented 1 year ago

边合并边删的。

yeyupiaoling commented 1 year ago

不支持多线程的,暂时没有实现

yeyupiaoling commented 1 year ago

你这么少数据合并反而增大了处理时间。

2651084156 commented 1 year ago

那个生成列表耗时是主要集中在重采样吗? 图片 我实现了一个简单的多进程来重采样 然后把那个参数设置为false 是不是会快不少? 对了pp那里的 conformer_online [WenetSpeech (10000小时) 这个模型有pyt版本吗?

yeyupiaoling commented 1 year ago

嗯嗯, masr还没有,

2651084156 commented 1 year ago

create_wenetspeech_data.py 里面用线程模块真的可以加速吗(毕竟py那个GIL) 感觉这种转换也算计算密集吧 不应该用进程模块吗 ,对了他删除文件是把wenetspeech的原始文件删了吗?

2651084156 commented 1 year ago

0%| | 11/2762 [00:01<06:04, 7.55it/s]C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\masr\data_utils\audio.py:536: RuntimeWarning: divide by zero encountered in log10 return 10 * np.log10(mean_square) 0%| | 11/2762 [00:01<06:24, 7.15it/s] Traceback (most recent call last): File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\create_data.py", line 33, in trainer.create_data(annotation_path=args.annotation_path, File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\masr\trainer.py", line 423, in create_data normalizer.compute_mean_istd(manifest_path=self.configs.dataset_conf.train_manifest, File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\masr\data_utils\normalizer.py", line 67, in compute_mean_istd for std1, means1, number1 in tqdm(test_loader): File "C:\ProgramData\Anaconda3\envs\py10\lib\site-packages\tqdm\std.py", line 1195, in iter for obj in iterable: File "C:\ProgramData\Anaconda3\envs\py10\lib\site-packages\torch\utils\data\dataloader.py", line 681, in next data = self._next_data() File "C:\ProgramData\Anaconda3\envs\py10\lib\site-packages\torch\utils\data\dataloader.py", line 721, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "C:\ProgramData\Anaconda3\envs\py10\lib\site-packages\torch\utils\data_utils\fetch.py", line 49, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "C:\ProgramData\Anaconda3\envs\py10\lib\site-packages\torch\utils\data_utils\fetch.py", line 49, in data = [self.dataset[idx] for idx in possibly_batched_index] File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\masr\data_utils\normalizer.py", line 113, in getitem feature = self.audio_featurizer.featurize(audio) File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\masr\data_utils\featurizer\audio_featurizer.py", line 50, in featurize audio_segment.normalize(target_db=self._target_dB) File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\masr\data_utils\audio.py", line 302, in normalize raise ValueError( ValueError: 无法将段规范化到 -20.000000 dB,因为可能的增益已经超过max_gain_db (300.000000 dB)

2651084156 commented 1 year ago

这甩出来一个 是太小了吗

yeyupiaoling commented 1 year ago

可能是音频音量太大或者太小。

2651084156 commented 1 year ago

图片 我gpu没有满是不是可以考虑开大bz?

2651084156 commented 1 year ago

图片 这loss 突然飞升是数据的问题吗?

yeyupiaoling commented 1 year ago

我gpu没有满是不是可以考虑开大bz?

显存足够的话,也可以增大batchsize

yeyupiaoling commented 1 year ago

这loss 突然飞升是数据的问题吗?

也有可能是学习率的问题,

2651084156 commented 1 year ago

emmm训练了几轮之后又减下去了 可能是预热时候的正常情况吧

2651084156 commented 1 year ago

图片 感觉这测试结果很不对劲emmm

2651084156 commented 1 year ago

3.5w步应该多少出一点东西吧

yeyupiaoling commented 1 year ago

训练多少轮了?训练的时候输出的字错率是多少?

2651084156 commented 1 year ago

图片 38轮字错率是cer吗

2651084156 commented 1 year ago

感觉有点奇怪 训练时候测试的 字错率 是0.1 然后单独看效果却到0.9了

2651084156 commented 1 year ago

还有个问题为什么词表会给不同的字分配相同的编码 图片 他们也不是同一个读音啊

yeyupiaoling commented 1 year ago

后面的数字是出现的次数。

yeyupiaoling commented 1 year ago

可能是你没有用同一个配置文件或者同一个字典,又或者是同一个标准差文件。

2651084156 commented 1 year ago

emmm 还真的是ev py里面默认是在线模型我炼的是离线的 不对切离线的配置文件直接报错了 File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\eval.py", line 28, in loss, error_result = trainer.evaluate(resume_model=args.resume_model.format(configs['use_model'], File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\masr\trainer.py", line 532, in evaluate self.model.load_state_dict(model_state_dict) File "C:\ProgramData\Anaconda3\envs\py10\lib\site-packages\torch\nn\modules\module.py", line 1604, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for ConformerModel: size mismatch for decoder.left_decoder.embed.0.weight: copying a param with shape torch.Size([4448, 256]) from checkpoint, the shape in current model is torch.Size([5097, 256]). size mismatch for decoder.left_decoder.output_layer.weight: copying a param with shape torch.Size([4448, 256]) from checkpoint, the shape in current model is torch.Size([5097, 256]). size mismatch for decoder.left_decoder.output_layer.bias: copying a param with shape torch.Size([4448]) from checkpoint, the shape in current model is torch.Size([5097]). size mismatch for decoder.right_decoder.embed.0.weight: copying a param with shape torch.Size([4448, 256]) from checkpoint, the shape in current model is torch.Size([5097, 256]). size mismatch for decoder.right_decoder.output_layer.weight: copying a param with shape torch.Size([4448, 256]) from checkpoint, the shape in current model is torch.Size([5097, 256]). size mismatch for decoder.right_decoder.output_layer.bias: copying a param with shape torch.Size([4448]) from checkpoint, the shape in current model is torch.Size([5097]). size mismatch for ctc.ctc_lo.weight: copying a param with shape torch.Size([4448, 256]) from checkpoint, the shape in current model is torch.Size([5097, 256]). size mismatch for ctc.ctc_lo.bias: copying a param with shape torch.Size([4448]) from checkpoint, the shape in current model is torch.Size([5097]).

2651084156 commented 1 year ago

图片 对了这个测试脚本可以同时也输出 音频文件的路径吗? 我有点想知道错这么多的是什么情况

2651084156 commented 1 year ago

图片 对了这个参数和数据量大概是一个什么关系?

2651084156 commented 1 year ago

已加载模型:models/conformer_offline_fbank/inference.pt C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\masr\infer_utils\inference_predictor.py:61: UserWarning: FALLBACK path has been taken inside: torch::jit::fuser::cuda::runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback (Triggered internally at ..\torch\csrc\jit\codegen\cuda\manager.cpp:334.) output_data = self.predictor.get_encoder_out(speech=audio_data, speech_lengths=audio_len) Traceback (most recent call last): File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\infer_path.py", line 81, in predict_audio() File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\infer_path.py", line 40, in predict_audio result = predictor.predict(audio_data=args.wav_path, use_pun=args.use_pun, is_itn=args.is_itn) File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\masr\predict.py", line 147, in predict output_data = self.predictor.predict(input_data, audio_len)[0] File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\masr\infer_utils\inference_predictor.py", line 61, in predict output_data = self.predictor.get_encoder_out(speech=audio_data, speech_lengths=audio_len) torch.jit.Error: The following operation failed in the TorchScript interpreter. Traceback of TorchScript, serialized code (most recent call last): File "code/torch/masr/model_utils/conformer/embedding.py", line 23, in get_encoder_out max_len0 = self.max_len _3 = torch.format(_0, offset, _2, max_len0) ops.prim.RaiseException(torch.add("AssertionError: ", _3))

    xscale = self.xscale
    x0 = torch.mul(x, xscale)

Traceback of TorchScript, original code (most recent call last):
  File "C:\Users\autumn\Desktop\poject_all\whisper_pipline\MASR\masr\model_utils\conformer\embedding.py", line 95, in get_encoder_out
            torch.Tensor: Positional embedding tensor (1, time, `*`).
        """
        assert offset + x.shape[
        1] < self.max_len, "offset: {} + x.shape[1]: {} is larger than the max_len: {}".format(
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        offset, x.shape[1], self.max_len)
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    x = x * self.xscale
    self.pe = self.pe.to(x.device)

RuntimeError: AssertionError: offset: 0 + x.shape[1]: 5199 is larger than the max_len: 5000

这个又是什么情况是太长了吗?

2651084156 commented 1 year ago

可能是你没有用同一个配置文件或者同一个字典,又或者是同一个标准差文件。

那个如果我改变默认采样率的话除了采样率还有什么参数是要改变的 发现16k的采样率损失的特征有点多

2651084156 commented 1 year ago

还有就是预处理推荐什么方法

2651084156 commented 1 year ago

那个 3kh的数据需要改变注意力维度吗

yeyupiaoling commented 1 year ago

可以改

2651084156 commented 1 year ago

可以改

那我改了采样率比如32k 预处理的其他部分要改吗比如要加大mel大小

2651084156 commented 1 year ago

可以改

那个3kh 数据集的11zip 1665514239466.opus 和9zip 1665488748116.opus 的数据是损坏的

yeyupiaoling commented 1 year ago

那个3kh 数据集的11zip 1665514239466.opus 和9zip 1665488748116.opus 的数据是损坏的

这样吗?你写个过滤,跳过一下。

yeyupiaoling commented 1 year ago

那我改了采样率比如32k 预处理的其他部分要改吗比如要加大mel大小

改的是模型大小,不是让你改采样率

yeyupiaoling commented 1 year ago

还有就是预处理推荐什么方法

预处理用默认的就好

2651084156 commented 1 year ago

那我改了采样率比如32k 预处理的其他部分要改吗比如要加大mel大小

改的是模型大小,不是让你改采样率

主要是默认的16k有的音频会胡的离谱 24k会好点 然后那个噪声数据应该怎么使用?

yeyupiaoling commented 1 year ago

胡的离谱

这是什么意思?目前大多数的语音识别论文都是16K的,所以这个最好不要修改。

yeyupiaoling commented 1 year ago

然后那个噪声数据应该怎么使用?

把噪声音频wav文件放在dataset/audio/noise,在执行create_data会生成对应的列表文件

2651084156 commented 1 year ago

胡的离谱

这是什么意思?目前大多数的语音识别论文都是16K的,所以这个最好不要修改。

就是重采样之后听着像口含麦克风一样模糊 直接人都听不出来是什么东西的情况 emm也可能是重采样的问题 重采样到24k倒是好点emmm