Decoding Tokens added by the user for Whisper models

aravindMahadevan commented 2 weeks ago

Feature request

Support decoding user defined added tokens that get added to end of the tokenizer's vocabulary for Whisper based models. This requires modifying the if statement in _decode_asr to make this work.

Motivation

Motivation for this proposal is to have feature parity with the tokenizers.decode which is able to decode user added tokens.

Your contribution

To support this feature, we just need to modify the if statement in _decode_asr from token >= timestamp_begin to token >= timestamp_begin && token <= timestamp_end where timestamp_end = this.model.convert_tokens_to_ids(["<|30.00|>"])[0].

Why this should work:

When a user adds a new token to the tokenizer, it gets placed at the end of the tokenizer's vocabulary. The last 1500 vocab tokens in whisper-tiny, whisper-tiny.en, whisper-small.en, whisper-small, whisper-base, whisper-base.en, whisper-large, and whisper-large-v2 correspond to timestamp tokens from "<|0.00|>" to "<|30.00|>". By bounding the if statement condition from token >= timestamp_begin to token >= timestamp_begin && token <= timstamp_end, we will ensure that added user tokens will be decoded as regular tokens as the condition will evaluate to False and we will go to the else block

xenova commented 2 weeks ago

Good spot! Feel free to submit a PR for this. Thanks! 🤗

aravindMahadevan commented 2 weeks ago

@xenova submitted the PR with a fix!

xenova / transformers.js