[transformer] Add moe_noisy_gate

llleohk commented 2 months ago

增加了noisy-gate 实验结果（aishell-1 20epoch）

decoding mode	Normal Gate	Noisy Gate
ctc_prefix_beam_search	9.60%	8.88%
att_rescoring	8.97%	8.23%

how to use:

xingchensong commented 2 months ago

先merge一下main

xingchensong commented 2 months ago

有paper link的话可以贴一下

llleohk commented 2 months ago

有paper link的话可以贴一下

好咧，参考的是谷歌的文章：https://arxiv.org/pdf/1701.06538.pdf

Mddct commented 2 months ago

贴class下边好奇完整的epoch跑完会咋样

这个作用是加速收敛呢还是最终效果也会变好

llleohk commented 2 months ago

贴class下边好奇完整的epoch跑完会咋样

这个作用是加速收敛呢还是最终效果也会变好

我跑个完整的epoch看看，之前测试的结果是最终效果也会变好，不过当时的moe不是用在encoder上

Mddct commented 2 months ago

后边会支持模型并行， moe这里需要特殊的处理，看到了这个参考下截个图放这里 ref：https://zhuanlan.zhihu.com/p/681154742

llleohk commented 2 months ago

后边会支持模型并行， moe这里需要特殊的处理，看到了这个参考下截个图放这里 ref：https://zhuanlan.zhihu.com/p/681154742

好咧周神，这个我研究一下

xingchensong commented 2 months ago

咋样啦，有最终结果了不

llleohk commented 2 months ago

咋样啦，有最终结果了不

模型还在训，卡有点慢。。明天能有结果

rookie0607 commented 2 months ago

蹲

llleohk commented 2 months ago

来了来了，结果来了：用的aishell-1，训练个100个epochs，encoder-moe	decoding mode	Normal Gate	Noisy Gate(train and decode)	Noisy Gate (only train)
ctc_prefix_beam_search	5.60%	5.62%	5.62%
att_rescoring	5.23%	5.27%	5.27%

从结果来看感觉加noisy没啥效果，不排除是不是数据量不够多的原因。。。而且推理加noisy和不加效果一样，我简单测了一下门控输出一致性，大概是96%。从log的loss来看，noisy的收敛是比normal要快，但是最后收敛的都差不多，这里贴个图：

然后测了1000条音频的门控输出的标准差平均值，好像起不到负载均衡的效果。。 Normal_std：38.99761676367856 Noisy_only_train_std：41.180588894741426 Noisy_train_decode_std: 41.19181262290304

llleohk commented 2 months ago

测试了一下noisy-moe在decoder的性能，感觉跟大模型一样，用在decoder的表现会更好

encoder-decoder的moe我显存不够跑不了，还需要各位大佬来验证一下效果了		ctc_prefix_beam_search
U2++-baseline	5.80%	5.06%
Normal Gate-Encoder	5.60%	5.23%
Noisy Gate(decode)-Encoder	5.62%	5.27%
Noisy Gate(only train)-Encoder	5.62%	5.27%
Normal Gate-Decoder	5.83%	5.07%
Noisy Gate(decode)-Decoder	5.77%	4.99%
Noisy Gate(only train)-Decoder	5.77%	4.99%

fclearner commented 1 month ago

测试了一下noisy-moe在decoder的性能，感觉跟大模型一样，用在decoder的表现会更好

encoder-decoder的moe我显存不够跑不了，还需要各位大佬来验证一下效果了

ctc_prefix_beam_search att_rescoring U2++-baseline 5.80% 5.06% Normal Gate-Encoder 5.60% 5.23% Noisy Gate(decode)-Encoder 5.62% 5.27% Noisy Gate(only train)-Encoder 5.62% 5.27% Normal Gate-Decoder 5.83% 5.07% Noisy Gate(decode)-Decoder 5.77% 4.99% Noisy Gate(only train)-Decoder 5.77% 4.99%

为啥在decoder效果更好

llleohk commented 1 month ago

测试了一下noisy-moe在decoder的性能，感觉跟大模型一样，用在decoder的表现会更好 encoder-decoder的moe我显存不够跑不了，还需要各位大佬来验证一下效果了 ctc_prefix_beam_search att_rescoring U2++-baseline 5.80% 5.06% Normal Gate-Encoder 5.60% 5.23% Noisy Gate(decode)-Encoder 5.62% 5.27% Noisy Gate(only train)-Encoder 5.62% 5.27% Normal Gate-Decoder 5.83% 5.07% Noisy Gate(decode)-Decoder 5.77% 4.99% Noisy Gate(only train)-Decoder 5.77% 4.99%

为啥在decoder效果更好

个人感觉小数据量的encoder-moe 加noisy在训练可能更均衡了但是很难训练充分，所以效果会更差

现在也在尝试只在后几层做moe，看看效果

llleohk commented 1 month ago

更新一下周神贴的方法的实验结果，encoder专家数量需要根据数据量来确定，太稀疏会影响性能		ctc_prefix_beam_search
U2++-baseline	5.80%	5.06%
Normal Gate-Encoder	5.60%	5.23%
Noisy Gate-Encoder	5.62%	5.27%
Normal Gate-Decoder	5.83%	5.07%
Noisy Gate-Decoder	5.77%	4.99%
mask Noisy Gate(4experts)-Encoder	5.46%	5.06%
mask Noisy Gate(8experts)-Encoder	5.82%	5.40%
mask Noisy Gate(4experts)-Decoder	5.85%	5.09%
mask Noisy Gate(8experts)-Decoder	5.76%	5.04%

MXuer commented 1 month ago

测试了一下noisy-moe在decoder的性能，感觉跟大模型一样，用在decoder的表现会更好

encoder-decoder的moe我显存不够跑不了，还需要各位大佬来验证一下效果了

ctc_prefix_beam_search att_rescoring U2++-baseline 5.80% 5.06% Normal Gate-Encoder 5.60% 5.23% Noisy Gate(decode)-Encoder 5.62% 5.27% Noisy Gate(only train)-Encoder 5.62% 5.27% Normal Gate-Decoder 5.83% 5.07% Noisy Gate(decode)-Decoder 5.77% 4.99% Noisy Gate(only train)-Decoder 5.77% 4.99%

请问

上面的cer解码是流式的还是非流式的啊。
最新一条里面的，u2++-baseline，这个attention rescoring在aishell readme里面能到4.63%，您这个是因为只训练了100个epoch是吗？

感谢。

llleohk commented 1 month ago

测试了一下noisy-moe在decoder的性能，感觉跟大模型一样，用在decoder的表现会更好 encoder-decoder的moe我显存不够跑不了，还需要各位大佬来验证一下效果了 ctc_prefix_beam_search att_rescoring U2++-baseline 5.80% 5.06% Normal Gate-Encoder 5.60% 5.23% Noisy Gate(decode)-Encoder 5.62% 5.27% Noisy Gate(only train)-Encoder 5.62% 5.27% Normal Gate-Decoder 5.83% 5.07% Noisy Gate(decode)-Decoder 5.77% 4.99% Noisy Gate(only train)-Decoder 5.77% 4.99%

请问

上面的cer解码是流式的还是非流式的啊。

最新一条里面的，u2++-baseline，这个attention rescoring在aishell readme里面能到4.63%，您这个是因为只训练了100个epoch是吗？

感谢。

cer解码的是非流式的，如果您需要的话我可以测试一下流式的结果
我的u2++-baseline没有完全对齐aishell readme里的训练参数，我是4卡，batch size是8，训练100个epoch；decode的时候average_num设的5

MXuer commented 1 month ago

测试了一下noisy-moe在decoder的性能，感觉跟大模型一样，用在decoder的表现会更好 encoder-decoder的moe我显存不够跑不了，还需要各位大佬来验证一下效果了 ctc_prefix_beam_search att_rescoring U2++-baseline 5.80% 5.06% Normal Gate-Encoder 5.60% 5.23% Noisy Gate(decode)-Encoder 5.62% 5.27% Noisy Gate(only train)-Encoder 5.62% 5.27% Normal Gate-Decoder 5.83% 5.07% Noisy Gate(decode)-Decoder 5.77% 4.99% Noisy Gate(only train)-Decoder 5.77% 4.99%

请问

上面的cer解码是流式的还是非流式的啊。

最新一条里面的，u2++-baseline，这个attention rescoring在aishell readme里面能到4.63%，您这个是因为只训练了100个epoch是吗？

感谢。

cer解码的是非流式的，如果您需要的话我可以测试一下流式的结果

我的u2++-baseline没有完全对齐aishell readme里的训练参数，我是4卡，batch size是8，训练100个epoch；decode的时候average_num设的5

不用测流式的啦，就是想知道一下这个解码的策略。感谢您的回答，感谢您的分享。

wenet-e2e / wenet

[transformer] Add moe_noisy_gate #2495