zencephalon / Tactful_Tokenizer

Accurate Bayesian sentence tokenizer in Ruby.
80 stars 13 forks source link

Right brackets are lost during tokenization #10

Closed louismullie closed 10 years ago

louismullie commented 12 years ago

Right brackets are lost from text after the following Regexp is applied: (?-mix:(.)(?=[?!\)";}\]*:@\.'])|(?=[\)}\]])(.)|(.)(?=[({\[])|((^|\s)-)(?=[^-])).

Input sentence:

These have been interpreted as reports of unidentified flying objects (UFOs), but may just as well describe meteors, and, since Obsequens, probably, writes in the 4th century, that is, some 400 years after the events he describes, they hardly qualify as eye-witness accounts.

Before Regexp:

These have been interpreted as reports of unidentified flying objects ( UFOs), but may just as well describe meteors, and, since Obsequens, probably, writes in the 4th century, that is, some 400 years after the events he describes, they hardly qualify as eye-witness accounts.

After Regexp:

These have been interpreted as reports of unidentified flying objects ( UFOs , but may just as well describe meteors, and, since Obsequens, probably, writes in the 4th century, that is, some 400 years after the events he describes, they hardly qualify as eye-witness accounts .

zencephalon commented 11 years ago

@louismullie, thanks for reporting this, I'll look into it.

zencephalon commented 10 years ago

@louismullie I added a test case based on the input you provided. It seems like this isn't an issue anymore.