winkjs / wink-eng-lite-web-model

English lite language model for Web Browsers
MIT License
11 stars 8 forks source link

accented characters treated differently from non-accented #13

Closed retorquere closed 4 months ago

retorquere commented 5 months ago

'Poincar\u00e9' gets shape 'Xxxxxé' instead of 'Xxxxxx', and "Poincare\u0301"(which is just the NFD form of the former) is tokenized as two tokens ('Poincare' and '\u0301').

rachnachakraborty commented 5 months ago

Hi @retorquere

Thanks for writing to us.

Both the special cases highlighted by you are noted to be taken up in next release of fixes.

Best, Rachna

retorquere commented 5 months ago

The release made for winkjs/wink-nlp#135 did not yet include this, correct?

sanjayaksaxena commented 5 months ago

Yes, will be initiating the work on the remaining two soon.