aishell复现结果和readme的结果不符

brightLLer commented 3 weeks ago

各位大佬们好，我们在aishell1上复现了whisper large-v3 + qwen2 7B的实验，但发现模型的输出存在明显的"复读"(尾部若干字重复了许多遍)以及输出标点符号，特殊符号等情况，我们在推理的时候将大模型的repetition_penalty提高了，复读现象有所好转，但删除所有标点符号后字错率仍高达11%+，与README.md中的5.55%差距较大，以下是我们的训练命令（代码中whisper中提特征是80维的，我们添加了一个n_mel=128参数以支持large-v3）:

torchrun --standalone --nnodes=1 --nproc_per_node=8 train.py \
        --llm_model_name_or_path Qwen2-7B-Instruct \
        --whisper_model_name_or_path whisper/large-v3.pt \
        --data_path aishell/train/train.jsonl \
        --eval_data_path aishell/dev/eval.jsonl \
        --bf16 True \
        --output_dir Qwen-7B-Instruct-whisper-large-v3-aishell \
        --num_train_epochs 10 \
        --per_device_train_batch_size 16 \
        --per_device_eval_batch_size 8 \
        --gradient_accumulation_steps 8 \
        --evaluation_strategy "no" \
        --save_strategy "steps" \
        --save_steps 100 \
        --save_total_limit 10 \
        --learning_rate 3e-4 \
        --weight_decay 0.01 \
        --adam_beta2 0.95 \
        --warmup_ratio 0.01 \
        --lr_scheduler_type "cosine" \
        --logging_steps 1 \
        --report_to "none" \
        --model_max_length 512 \
        --n_mels 128 \
        --gradient_checkpointing \
        --dataloader_num_workers 4 \
        --dataloader_prefetch_factor 10 \
        --deepspeed ds_config_zero3.json

robin1001 commented 3 weeks ago

可以按 readme 中给出的默认配置先跑跑试试看，多测几个中间模型。

brightLLer commented 3 weeks ago

可以按 readme 中给出的默认配置先跑跑试试看，多测几个中间模型。

readme里的配置是1.5B的，7B的也是这个配置吗，我们重新按照readme里的配置做了实验，但效果还没有我上面提问的那套配置好，模型的输出在胡说八道了....o(╥﹏╥)o

robin1001 commented 3 weeks ago

可以都用最大的，7B LLM 和 whisper large，跑跑上限。

brightLLer commented 3 weeks ago

可以都用最大的，7B LLM 和 whisper large，跑跑上限。

我们实验就是按这两个最大的跑的，训练了10个epoch，loss也降到非常低了，和readme里的曲线图基本一致，但wer也只能到11%左右，其中插入和替换错误比较多，看起来像是因为大模型本身的幻觉引起的...

KIP1024 commented 3 weeks ago

可以都用最大的，7B LLM 和 whisper large，跑跑上限。

我们实验就是按这两个最大的跑的，训练了10个epoch，loss也降到非常低了，和readme里的曲线图基本一致，但wer也只能到11%左右，其中插入和替换错误比较多，看起来像是因为大模型本身的幻觉引起的...

同学您好，可以咨询一下您训练所用的环境配置吗？我不知道需要什么样的显卡和训练流程。我的企鹅号是1147893880

robin1001 commented 3 weeks ago

8 卡 3090

KIP1024 commented 3 weeks ago

8 卡 3090

恩恩好的，多谢！

KIP1024 commented 1 week ago

8 卡 3090

彬哥，实验室有一张Tesla A100-40G的卡，可以玩QWen2 7B吗，打算微调一下做我们自己场景的语言模型，最后和声学模型结合做ASR

wenet-e2e / west

aishell复现结果和readme的结果不符 #4