yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.
https://parser.yzhang.site/
MIT License
827 stars 139 forks source link

Crash when there are parentheses among tokens #65

Closed hristost closed 3 years ago

hristost commented 3 years ago

The code below:

import nltk
import supar

sent = 'Supar (in particular, the tree binarization) crashes when parentheses are present in input.'

parser = supar.Parser.load('crf-con-en')

tokens = [nltk.word_tokenize(sent)]
parsed = parser.predict(tokens).sentences

print(parsed)

crashes when there are parentheses in the input. Backtrace as follows:

2021-04-12 23:08:45 INFO Loading the data
Traceback (most recent call last):                           
  File "paren.py", line 9, in <module>
    parsed = parser.predict(tokens).sentences
  File "/usr/local/lib/python3.8/site-packages/supar/parsers/crf_constituency.py", line 129, in predict
    return super().predict(**Config().update(locals()))
  File "/usr/local/lib/python3.8/site-packages/supar/parsers/parser.py", line 129, in predict
    dataset = Dataset(self.transform, data)
  File "/usr/local/lib/python3.8/site-packages/supar/utils/data.py", line 38, in __init__
    self.sentences = transform.load(data, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/supar/utils/transform.py", line 656, in load
    sentences.append(TreeSentence(self, tree))
  File "/usr/local/lib/python3.8/site-packages/supar/utils/transform.py", line 681, in __init__
    Tree.factorize(Tree.binarize(tree)[0])]
  File "/usr/local/lib/python3.8/site-packages/supar/utils/transform.py", line 520, in binarize
    if not isinstance(child[0], nltk.Tree):
  File "/usr/local/lib/python3.8/site-packages/nltk-3.5-py3.8.egg/nltk/tree.py", line 162, in __getitem__
    return list.__getitem__(self, index)
IndexError: list index out of range

This seems related to #64 and #59. I tried editing the code to not index into empty arrays, but i ended up either discarding tokens, or triggering an error inside nltk.Tree.collapse_unary.

When I replaced the parentheses with square brackets (Supar [the parser] crashes when parentheses are present in input. ), parsing worked but produced weird result -- "the parser" moved inside the verb phrase:

(TOP
 (S
  (NP
   (_ Supar))
  (VP
   (_
    [)
    (SBAR
     (S
      (NP
       (NP
    (_ the)
    (_ parser))
       (_]))
     (VP
      (_ crashes)
      (SBAR
       (WHADVP
    (_ when))
       (S
    (NP
     (_ parentheses))
    (VP
     (_ are)
     (ADJP
      (_ present)
      (PP
       (_ in)
       (NP
        (_ input)))))))))))
  (_.)))

Has the model been trained on text that uses parentheses? Or are we expected to strip out text inside parentheses and parse it separately?

yzhangcs commented 3 years ago

@hristost Hi, in PTB, parentheses are normalized as -LRB-/-RRB-. So replacing (/) with -LRB-/-RRB- may lead to better results.

>>> print(parser.predict(['Supar', '-LRB-', 'in', 'particular', ',', 'the', 'tree', 'binarization', '-RRB-', 'crashes', 'when', 'parentheses', 'are', 'present', 'in', 'input', '.'],verbose=False).sentences[0].trees.pformat(30))
(TOP
  (S
    (NP
      (NP (_ Supar))
      (PRN
        (_ -LRB-)
        (PP
          (_ in)
          (NP (_ particular)))
        (_ ,)
        (NP
          (_ the)
          (_ tree)
          (_ binarization))
        (_ -RRB-)))
    (VP
      (_ crashes)
      (SBAR
        (WHADVP (_ when))
        (S
          (NP
            (_ parentheses))
          (VP
            (_ are)
            (ADJP
              (_ present)
              (PP
                (_ in)
                (NP
                  (_ input))))))))
    (_ .)))
>>> print(parser.predict(['Supar', '[', 'in', 'particular', ',', 'the', 'tree', 'binarization', ']', 'crashes', 'when', 'parentheses', 'are', 'present', 'in', 'input', '.'],verbose=False).sentences[0].trees.pformat(30))
(TOP
  (S
    (S
      (NP
        (NP (_ Supar))
        (_ [))
      (PP
        (_ in)
        (NP (_ particular))))
    (_ ,)
    (NP
      (NP
        (_ the)
        (_ tree)
        (_ binarization))
      (_ ]))
    (VP
      (_ crashes)
      (SBAR
        (WHADVP (_ when))
        (S
          (NP
            (_ parentheses))
          (VP
            (_ are)
            (ADJP
              (_ present)
              (PP
                (_ in)
                (NP
                  (_ input))))))))
    (_ .)))

I will fix this issue in the next release. Thank you for your problem.

yzhangcs commented 3 years ago

@hristost Fixed. Refer to the code below (available in SuPar v1.0.1). https://github.com/yzhangcs/parser/blob/2bb44dc9dafa212e90086cd2da580165c2609fa4/supar/utils/transform.py#L499-L532