yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.
https://parser.yzhang.site/
MIT License
832 stars 141 forks source link

NLTK error when parsing sentences with unescaped parentheses. #115

Closed norpadon closed 1 year ago

norpadon commented 1 year ago
sentence = 'This is a t(est)'
parser = Parser.load('crf-con-roberta-en')
parser.predict([sentence], lang='en', verbose=False)

crashes with a following error:

File /opt/homebrew/lib/python3.10/site-packages/nltk/tree/tree.py:731, in Tree._parse_error(cls, s, match, expecting)
    730 msg += '\n{}"{}"\n{}^'.format(" " * 16, s, " " * (17 + offset))
--> 731 raise ValueError(msg)

ValueError: Tree.read(): expected ')' but got 'end-of-string'
            at index 71.
                "..._ -RRB-)))"
                              ^

The same is true for sentence = '(713)853-7041'.

If I add a whitespace before ( and ), everything works fine.

yzhangcs commented 1 year ago

@norpadon Hi, the problem is caused because brackets are not totally handled after tokenization.

>>> from supar.utils.tokenizer import Tokenizer
>>> t = Tokenizer()
>>> t('This is a t(est)')
['This', 'is', 'a', 't(', 'est', ')']
>>> t('This is a t (est)')
['This', 'is', 'a', 't', '(', 'est', ')']

This bug has been fixed by latest commits, referring to https://github.com/yzhangcs/parser/blob/ce34fc254e5a0757605c5be7db6a2cd089adc2f7/supar/utils/transform.py#L420-L458