Blank token id hard coded in C++ decoders

zhr1201 commented 10 months ago

Describe the bug Following the PR that adds whisper feature extraction https://github.com/wenet-e2e/wenet/pull/2320. The transcript produced in that PR contains lots of <|notimestampes|>. This is because i exported whisper tokenizers with 99 languages, but for whisper large v3, tokenizer it's 100 langs. And the token corresponding to the extra output in the transcript is actually <|nospeech|>, which is used as the blank token in CTC. Fixing the token list does produce reasonable output, but still significantly worse than the one produced by transcribe.py that simulates a streaming inference.

I think there is still some mismatch in the decoder part, at least It's not possible to pass in blank token id through CLI. For ctc prefix beam search, there is a field in the config https://github.com/wenet-e2e/wenet/blob/4c81459f9bf12bc0d9b2c2381b3d003c764bdb1a/runtime/core/decoder/ctc_prefix_beam_search.h#L30, but not passed in through CLI https://github.com/wenet-e2e/wenet/blob/4c81459f9bf12bc0d9b2c2381b3d003c764bdb1a/runtime/core/decoder/params.h#L75

and for wfst decoder, it assume the first token is blank. https://github.com/wenet-e2e/wenet/blob/4c81459f9bf12bc0d9b2c2381b3d003c764bdb1a/runtime/core/decoder/ctc_wfst_beam_search.cc#L80

To Reproduce As described above

Expected behavior CLI can handle blank tokens for different models, not just the case where the first token is blank.

[Bonus] decoder main produces the same result as transcribe.py for streaming finetuned whisper trained with hybrid ctc+attention loss

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: [e.g. iOS]
Browser [e.g. chrome, safari]
Version [e.g. 22]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]
OS: [e.g. iOS8.1]
Browser [e.g. stock browser, safari]
Version [e.g. 22]

Additional context Add any other context about the problem here.

robin1001 commented 10 months ago

Yes, it's by design and it works fine before. It is a problem for whisper pretrained model since there is no blank in whisper tokenizer.

robin1001 commented 10 months ago

@xingchensong any comment?

xingchensong commented 10 months ago

we should make it configurable like this :

wenet-e2e / wenet

Blank token id hard coded in C++ decoders #2329