Mddct commented 7 months ago

现状

deepspeed https://github.com/wenet-e2e/wenet/pull/2055
中文paraformer 全语种whisper https://github.com/wenet-e2e/wenet/pull/2139 @xingchensong https://github.com/wenet-e2e/wenet/pull/2141
代码简洁容易diy
llm decoder onoy 下代码几乎一致
有 lora的pr, https://github.com/wenet-e2e/wenet/pull/2049

语音大模型是一方面https://github.com/wenet-e2e/wenet/issues/2097，另外一个路子是和llm的结合，后者目前paper 日益增多，缺少合语音、llm的一体的简单易于diy/研究的repo

在此有个想法，wenet集成llm 比如llama

宗旨

数据、模型、代码全部都会开源开放，欢迎大家贡献，有数据的出数据，有意见的出意见，有机器的出机器，大家共创。
且做且分析

目标

base 当前/未来语音大模型https://github.com/wenet-e2e/wenet/issues/2097/ + xxx llm，构造Audio+LLM的语音语言打模型，unify all speech task + speech multi-round chat ability
积累语音所有任务的数据以及构造audio instruct/prompt数据

Action

数据

[ ] https://github.com/wenet-e2e/wenet/issues/2097#issue-1971787771
[ ] 构造speech instruct/prompt

训练

[x] FSDP https://github.com/wenet-e2e/wenet/pull/2412
[ ] convert xxxllm to wenet format
[ ] 解决llm base下语音热词/itn等问题
[ ] generate/chat
[x] tokenizer 重构：https://github.com/wenet-e2e/wenet/issues/2142#issuecomment-1813736407 @Mddct
[x] 新IO ，支持各种灵活的输入 https://github.com/wenet-e2e/wenet/pull/2316
[x] flash att: https://github.com/wenet-e2e/wenet/pull/2351
[ ] adapter/lora
[ ] 融合方案：扩词典+embedding
[ ] multi task https://github.com/QwenLM/Qwen-Audio (p0)
[ ] https://github.com/wenet-e2e/wenet/issues/2097#issue-1971787771
[x] wenet 增加 paraformer 支持（目前最好的中文模型，可以用来中文speech基座）https://github.com/wenet-e2e/wenet/pull/2067
[ ] generate tokens and can be used by speech generation (translation/tts)

部署

wenet.cpp (speech.cpp+xxxllm.cpp)
[ ] int4量化，降低带宽需求

目前可行方案，

https://github.com/salesforce/BLIP
https://github.com/QwenLM/Qwen-Audio
https://google-research.github.io/seanet/audiopalm/examples/
etc 特点：主要为微调，微调对数据量要求不高，方法类似, 需要基座： Llama + whisper + tune

robin1001 commented 7 months ago

可以的，我们之前的思路是：

模型放大，简单粗暴，大就是强。
LLM based，站在巨人的肩膀。
在语音任务中直接引入 LLM 的方法，让语音模型直接有理解能力。

目前在做的是1，在做 1 的生态和基础设施。2 确实现在出现了很多的 paper，是新的研究热点。以前的思路是资源有限，先做 1，1 和 2 本质上是不冲突的，社区有资源的话，可以都搞起来。

xingchensong commented 7 months ago

周哥可以针对这条路线，起草个计划，我们给你打工

Mddct commented 7 months ago

周哥可以针对这条路线，起草个计划，我们给你打工

大佬谦虚了，我给你打工

我先整理下这方面的最新成果，看能不能抽出共性东西，再写个TODO (现在功力不够)

xingchensong commented 7 months ago

提一个，tokenizer可能需要重构一下，现在有两种模式，一种是纯词表模式，一种是bpe模式，未来肯定还会有适配LLM的模式，这样就是三种模式了，需要重新构建下代码

Mddct commented 7 months ago

step1: support wenet llama2, Adhering to the principle of maximizing reuse of wenet code

Features

[ ] [Parameter conversion]
- [ ] [Hugging Face to wenet]
- [ ] [wenet to Hugging Face]
[x] [newo io and Data loading]
[ ] [Model architecture]
- [ ] [Dropout]
- [x] [RMS Norm]
- [x] [Embedding]
- [x] [Rotary embedding]
- [x] [Attention]
- [x] [Decoder block]
- [x] [Decoder]
- [ ] [Llama Model］
- [ ] [Llama]
[ ] [Cross entropy loss]
[ ] Training
- [x] Data parallelism
- [ ] [Model parallelism]
- [ ] Other parallelisation schemes
[ ] Generation/Chat
- [x] [KV cache]
- [ ] Left padding
- [ ] [Presence penalty]
- [ ] Frequency penalty
- [ ] Beam search
- [ ] Beam sampling
- [ ] Top-k sampling
- [ ] Top-p sampling
[ ] [Documentation]

xingchensong commented 7 months ago

我在想，直接import transformers行不行，和自己重新实现一遍，各有什么pros & cons

Mddct commented 7 months ago

我在想，直接import transformers行不行，和自己重新实现一遍，各有什么pros & cons

第一步先单纯imoort transformers 后边再看会有什么问题，上边那个列表先列那里了。

缺点是：不好魔改，比如阿里的通义audio 会有个model parallel ，hugface封装过厚 fintune audio llm 如果需要对llm做些改动需要去hug里去改。而且输入输出需要符合hg的接口

robin1001 commented 7 months ago

+1，我觉得对于文本大模型的支持，直接 import transformer，不需要重复造轮子了。

xingchensong commented 7 months ago

https://github.com/espnet/espnet/pull/4099 this might be a reference for integrating hugginface

Mddct commented 7 months ago

hg的llm模型几乎是下边伪代码pattern


from transformers import CasulLM...

tokenizer = from_pretrain

tokenizer.add_special_tokens

# 这里dataset 包含mask的计算
dataset = ...

model = from_pretain 

# 这里可以构造和语音id或emb的input+ text 给model， 包含att mask
output = model（....）

calac loss

model.generate for base

model.chat for chat

xingchensong commented 7 months ago

提一个，tokenizer可能需要重构一下，现在有两种模式，一种是纯词表模式，一种是bpe模式，未来肯定还会有适配LLM的模式，这样就是三种模式了，需要重新构建下代码

This might be a reference for refactoring tokenizer https://github.com/espnet/espnet/tree/master/espnet2/text

Mddct commented 7 months ago

hg的llm模型几乎是下边伪代码pattern

from transformers import CasulLM...

tokenizer = from_pretrain

tokenizer.add_special_tokens

# 这里dataset 包含mask的计算
dataset = ...

model = from_pretain 

# 这里可以构造和语音id或emb的input+ text 给model， 包含att mask
output = model（....）

calac loss

model.generate for base

model.chat for chat

cv有篇工作，https://arxiv.org/pdf/2311.03079.pdf

该做法整体看，和通义audio相似，区别在于他给llm 加了个cross attention，这里涉及到了对llm的修改

xingchensong commented 7 months ago

涉及到修改的可不可以通过下面的方式：

from transformers import XXXModelForCasulLM
from wenet.transformer.asr_model import AsrModel

class NewModel(nn.Module, AsrModel, XXXModelForCasulLM):
    def __init__(self, ...):
        # init father
        super().__init__()
        # add new member if needed, i.e.,
        self.new_member = nn.Identity()

    def forward(self, ...):
        # overwrite father
        pass

    # overwrite other functions if needed, i.e., function from XXXModelForCasulLM
    def from_pretrained(self, ...):
        pass

    # overwrite other functions if needed, i.e., function from AsrModel
    def _cal_att_loss(self, ...):
        pass

Mddct commented 7 months ago

涉及到修改的可不可以通过下面的方式：

from transformers import XXXModelForCasulLM
from wenet.transformer.asr_model import AsrModel

class NewModel(nn.Module, AsrModel, XXXModelForCasulLM):
    def __init__(self, ...):
        # init father
        super().__init__()
        # add new member if needed, i.e.,
        self.new_member = nn.Identity()

    def forward(self, ...):
        # overwrite father
        pass

    # overwrite other functions if needed, i.e., function from XXXModelForCasulLM
    def from_pretrained(self, ...):
        pass

    # overwrite other functions if needed, i.e., function from AsrModel
    def _cal_att_loss(self, ...):
        pass

+1 也倾向于这种

Mddct commented 7 months ago

gemini 是最近谷歌发布的多模态模型，支持语音输入

文中提到了语音输入还是经过”USM“化，输入LLM 从头pretrain

（NOTE：区别图片patch，无预训练比如vit，直接patch 输入）

该形式和通义之类的实现是类似的（区别解释通义的mutli task），个人觉得咱们可以搞搞这类实现的代码框架

@robin1001 @xingchensong

TODO:

[ ] add_special tokens https://github.com/wenet-e2e/wenet/pull/2186
[ ] load huggingface model
[ ] adapter
[ ] load encoder from whsiper https://github.com/wenet-e2e/wenet/pull/2141
[x] 多么形式上的IO https://github.com/wenet-e2e/wenet/issues/2152

如果上述实现，即使不训练，也可以load 通义audio的开源模型

Mddct commented 5 months ago

https://arxiv.org/abs/2402.01831

wenet-e2e / wenet

［feats/llm］语音大模型背景下的llm集成 #2142

宗旨

目标

Action

数据

训练

部署

目前可行方案，

step1: support wenet llama2, Adhering to the principle of maximizing reuse of wenet code

Features