Open mircealungu opened 3 months ago
I can look into this, would you share those text ids so I can take a look regarding those examples?
The second example, I am not exactly sure what would be the expected behaviour? The other two, I just think we need to improve the pattern matching in the algorithm. We could also consider using a tokenizer from the API to do this to ensure more consistency throughout.
I have something like this now:
Essentially, I have added a process when we split the word token based on whitespaces, there is now a second pass that checks for joint words and these special tokens - where we could handle these cases.
I have added a check with abbreviations that first checks if the next word is uppercase to decide if it's the end of the sentence and the results are like this:
this should at least handle cases like ift. as long it doesn't start with a proper noun. I didn't find a list of abbreviations so without using a pipeline to do some more complex parsing this might be the best we can do. If we could check for proper nouns I think we could have something a little more robust.
Added branch https://github.com/zeeguu/web/tree/update-tokenization with suggestions to obtain the result above.
When the original text misses a space after the end of the sentence the last word of the previous sentence and the first one of the next are considered to be one.
Equally wrong is when two words are not connected because one is an abbreviation
And another imperfect situation: