Describe the bug
Following the PR that adds whisper feature extraction https://github.com/wenet-e2e/wenet/pull/2320. The transcript produced in that PR contains lots of <|notimestampes|>.
This is because i exported whisper tokenizers with 99 languages, but for whisper large v3, tokenizer it's 100 langs. And the token corresponding to the extra output in the transcript is actually <|nospeech|>, which is used as the blank token in CTC. Fixing the token list does produce reasonable output, but still significantly worse than the one produced by transcribe.py that simulates a streaming inference.
Describe the bug Following the PR that adds whisper feature extraction https://github.com/wenet-e2e/wenet/pull/2320. The transcript produced in that PR contains lots of <|notimestampes|>. This is because i exported whisper tokenizers with 99 languages, but for whisper large v3, tokenizer it's 100 langs. And the token corresponding to the extra output in the transcript is actually <|nospeech|>, which is used as the blank token in CTC. Fixing the token list does produce reasonable output, but still significantly worse than the one produced by transcribe.py that simulates a streaming inference.
I think there is still some mismatch in the decoder part, at least It's not possible to pass in blank token id through CLI. For ctc prefix beam search, there is a field in the config https://github.com/wenet-e2e/wenet/blob/4c81459f9bf12bc0d9b2c2381b3d003c764bdb1a/runtime/core/decoder/ctc_prefix_beam_search.h#L30, but not passed in through CLI https://github.com/wenet-e2e/wenet/blob/4c81459f9bf12bc0d9b2c2381b3d003c764bdb1a/runtime/core/decoder/params.h#L75
and for wfst decoder, it assume the first token is blank. https://github.com/wenet-e2e/wenet/blob/4c81459f9bf12bc0d9b2c2381b3d003c764bdb1a/runtime/core/decoder/ctc_wfst_beam_search.cc#L80
To Reproduce As described above
Expected behavior CLI can handle blank tokens for different models, not just the case where the first token is blank.
[Bonus] decoder main produces the same result as transcribe.py for streaming finetuned whisper trained with hybrid ctc+attention loss
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Smartphone (please complete the following information):
Additional context Add any other context about the problem here.