winkjs / wink-eng-lite-web-model

English lite language model for Web Browsers
MIT License
11 stars 9 forks source link

word-joiner (U+2060) breaks words #14

Closed retorquere closed 7 months ago

retorquere commented 8 months ago

would it be possible to keep text "separated" by the word-joiner character (U+2060) to be considered one word? So eg 'main\u2060tain' would be one word.

rachnachakraborty commented 8 months ago

Hi @retorquere

Thanks for highlighting the word-joiner issue to us.

Currently, winkNLP does not handle this case.

We will take this issue shortly after our major upcoming release of word embeddings support for winkNLP.

Shall keep you posted.

Best, Rachna

sanjayaksaxena commented 7 months ago

@retorquere have released 1.7.0 version of the model, which now supports word joiner and accented characters; winkNLP remains unchanged.