winkjs / wink-nlp

Developer friendly Natural Language Processing ✨
https://winkjs.org/wink-nlp/
MIT License
1.22k stars 57 forks source link

The word AI is classified as the word be during POS tagging. #141

Open moskaliukua opened 2 months ago

moskaliukua commented 2 months ago

Hi, I have run into one problem in POS tagging. in sentences like: "It is an AI" It seems to be consisten in other sentences as well:

"it made a lot of waves in the AI field." I would expect that the word "AI" is classified as PROPN, but instead I get AUX and lemma is be

import winkNLP from 'wink-nlp';
import model from 'wink-eng-lite-web-model';
const nlp = winkNLP(model);
const doc = nlp.readDoc('It is an AI.').
console.log(doc.tokens().out(its.lemma));
 // [ 'it', 'be', 'an', 'be', '.' ]
doc.printTokens();

token      p-spaces   prefix  suffix  shape   case    nerHint type     normal/pos
———————————————————————————————————————————————————————————————————————————————————————
It                0   It      It      Xx      3       0       word     it / PRON
is                1   is      is      xx      1       0       word     is / AUX
an                1   an      an      xx      1       0       word     an / DET
AI                1   AI      AI      XX      2       0       word     ai / **AUX**
.                 0   .       .       .       0       0       punctuat . / PUNCT

total number of tokens: 5

versions of packages: "wink-eng-lite-web-model": "^1.8.0", "wink-nlp": "^2.3.0",

rachnachakraborty commented 2 months ago

Hi @moskaliukua,

Thanks for highlighting this issue.

The lexicon was trained using corpus containing archaic words like Ain't. This gets tokenised as two tokens 'Ai, not', where Ai is a Auxiliary verb.

We plan to rebuild it soon with the corrections incorporated.

Shall keep you posted.

Best, Rachna