winkjs / wink-eng-lite-web-model

English lite language model for Web Browsers
MIT License
11 stars 8 forks source link

Thin spaces (U+2009) are removed from sentences #15

Closed dchest closed 3 months ago

dchest commented 3 months ago

Example (using this observable):

{
  const doc = nlp.readDoc("hello world"); 
  doc.sentences()
     .each((e) => e.markup('<tr><td>', '</td></tr>'));
  return html`${'<table><tr><th>Sentences</th>'+doc.out(its.markedUpText)+'</table>'}`;
}

This returns: "helloworld".

Note that between hello and world there's U+2009 thin space, not the U+20 space.

rachnachakraborty commented 3 months ago

Hi @dchest

Thank you for highlighting the miss.

Shall keep you posted here when we fix it.

Best, Rachna

sanjayaksaxena commented 3 months ago

Released following

  1. wink-eng-lite-web-model Version 1.8.0
  2. wink-nlp Version 2.3.0

These together now support handling of em/en, third/quarter, thin/hair, medium math spaces & regular/narrow nbsp.

dchest commented 3 months ago

Awesome, thanks a lot! I verified that the new versions work with thin spaces in my use case.