Better sentence tokenization needed in the InteractiveText

zeeguu / web

Frontend for the zeeguu web application.

https://www.zeeguu.org

3 stars 5 forks source link

Better sentence tokenization needed in the InteractiveText #334

Open mircealungu opened 3 months ago

mircealungu commented 3 months ago

When the original text misses a space after the end of the sentence the last word of the previous sentence and the first one of the next are considered to be one .

Equally wrong is when two words are not connected because one is an abbreviation

And another imperfect situation:

tfnribeiro commented 3 months ago

I can look into this, would you share those text ids so I can take a look regarding those examples?

The second example, I am not exactly sure what would be the expected behaviour? The other two, I just think we need to improve the pattern matching in the algorithm. We could also consider using a tokenizer from the API to do this to ensure more consistency throughout.

tfnribeiro commented 3 months ago

I have something like this now:

Essentially, I have added a process when we split the word token based on whitespaces, there is now a second pass that checks for joint words and these special tokens - where we could handle these cases.

tfnribeiro commented 3 months ago

I have added a check with abbreviations that first checks if the next word is uppercase to decide if it's the end of the sentence and the results are like this:

this should at least handle cases like ift. as long it doesn't start with a proper noun. I didn't find a list of abbreviations so without using a pipeline to do some more complex parsing this might be the best we can do. If we could check for proper nouns I think we could have something a little more robust.

tfnribeiro commented 3 months ago

Added branch https://github.com/zeeguu/web/tree/update-tokenization with suggestions to obtain the result above.