xenova / transformers.js

State-of-the-art Machine Learning for the web. Run 🤗 Transformers directly in your browser, with no need for a server!
https://huggingface.co/docs/transformers.js
Apache License 2.0
10.86k stars 658 forks source link

[Feature request] Return offset mapping using tokenizer #425

Open arttorres0 opened 9 months ago

arttorres0 commented 9 months ago

Return offset mapping using tokenizer

Hi guys, awesome work with this project, it is helping us a lot! Is it possible to add the parameter "return_offset_mapping" to the Tokenizer class, just like in the Transformers python library?

Reason for request

The returned offsets are crucial for our project in Named Entity Recognition, as we need to associate the predicted tags with the original sentence (prior to tokenization). Currently we can only associate the predicted tags with the list of subtokens, as they are a 1-to-1 association, and we know that the list of subtokens ignore some elements, such as whitespaces, which are present in the original sentence.

Additional context

Reference for "return_offset_mapping" on Transformers python library: https://huggingface.co/docs/transformers/v4.35.2/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.__call__.return_offsets_mapping

xenova commented 9 months ago

Hi there 👋 Thanks for the suggestion! I do think this will be a useful addition to the library.

Is this something you'd be interested in contributing by any chance? Otherwise, I'll add this as a "good first issue" for someone else in the community to help contribute :)

RingoTC commented 7 months ago

I can do that.