xenova / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!
https://huggingface.co/docs/transformers.js
Apache License 2.0
9.71k stars 571 forks source link

Fixes Issue #803 #804

Open aravindMahadevan opened 2 weeks ago

aravindMahadevan commented 2 weeks ago

Modify the _decode_asr method to support decoding user defined token in Whisper based models. Addresses issue in #803

aravindMahadevan commented 2 weeks ago

@xenova committed the fix, let me know if there is anything else needed!

xenova commented 1 week ago

Thanks! Will merge after tests pass. By the way, do you have an example of a whisper model which such tokens? Might be good to add a test.

HuggingFaceDocBuilderDev commented 1 week ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

aravindMahadevan commented 1 week ago

Hi @xenova, there are two tests failing due to a few test models not having the <30.00> token. Here's one such example: https://huggingface.co/Xenova/whisper-small/resolve/output_attentions/tokenizer.json.

Do you have any suggestions on how to overcome this issue? This token does exist in all of the base whisper models.

xenova commented 1 week ago

We could probably use the time precision (0.02) to calculate the offset: 30/0.02 + 1 = 1501 tokens (50364 -> 51864). Another fix is to simply update the tokenizer.json.

Also, can you provide a model which does have added tokens after the final timestamps token?

aravindMahadevan commented 1 week ago

Hi @xenova , that was a good suggestion and I have updated the logic to the following: const timestamp_end = timestamp_begin + total_timestamp_tokens const total_timestamp_tokens = (30.00 - 0.00) / 0.02

This logic should work on both English only Whisper and multilingual Whisper variants. The beginning timestamp of 0.00 is equal to token id 50363 and 50364 in the English only Whisper variants and Multilingual Whisper variants respectively. Similarly, the final timestamp token id of 30.00 is equal to 51863 and 51864 in the English only Whisper variants and Multilingual Whisper variants respectively as well. In both cases, the final timestamp token id occurs at an offset of 1500 which is what total_timestamp_tokens evaluates to.

I do not have a model with added tokens that I can share publicly. I did find this model https://huggingface.co/oza75/whisper-bambara-asr-001 which has added tokens after the final timestamp but it's a special token.