yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.
https://parser.yzhang.site/
MIT License
827 stars 139 forks source link

How to use with spaCy nlp pipeline? #72

Closed emadg closed 3 years ago

emadg commented 3 years ago

Thanks for providing this great work on parsing.

Is there a way to plug-in the constituency parser as a component for spaCy NLP?

I found example of such capability for Benepar library.

import spacy
from benepar.spacy_plugin import BeneparComponent

nlp = spacy.load('en_core_web_trf')
if spacy.__version__.startswith('2'):
    nlp.add_pipe(BeneparComponent("benepar_en3"))
else:
    nlp.add_pipe("benepar", config={"model": "benepar_en3"})

doc = nlp('The time for action is now. It is never too late to do something.')
sent = list(doc.sents)[0]
print(sent._.parse_string)
# (S (NP (NP (DT The) (NN time)) (PP (IN for) (NP (NN action)))) (VP (VBZ is) (ADVP (RB now))) (. .))

It would be nice to have your library available as part of spaCy NLP pipeline to take advantage of spaCy's tokenizer, POS-tagger as inputs to your constituency parser. (although I think based on your paper the CRF-Constituency parser does not use POS tags?)

yzhangcs commented 3 years ago

@emadg Thank you for the suggestion. I'll take some time to understand the APIs. Very glad to integrate models into Spacy, if it doesn't take a lot of effort.

yzhangcs commented 3 years ago

@emadg Hi, just spent some time understanding APIs. It's not difficult to integrate supar into spacy pipeline, though some hacking is required.

>>> from supar import Parser
>>> import spacy
>>> from spacy.tokens import Doc, Span
>>> parser = Parser.load('crf-con-en')
>>> Span.set_extension("con_tree", getter=lambda x: parser.predict([i.text for i in x],verbose=False)[0])
>>> nlp = spacy.load('en_core_web_md')
>>> doc = nlp("The time for action is now. It's never too late to do something.")
>>> for sent in doc.sents:
...     sent._.con_tree.pretty_print()
... 
                          TOP                   
                           |                     
                           S                    
               ____________|__________________   
              NP                     |        | 
      ________|_______               |        |  
     |                PP             VP       | 
     |             ___|____       ___|___     |  
     NP           |        NP    |      ADVP  | 
  ___|___         |        |     |       |    |  
 _       _        _        _     _       _    _ 
 |       |        |        |     |       |    |  
The     time     for     action  is     now   . 

                       TOP                           
                        |                             
                        S                            
  ______________________|__________________________   
 |                      VP                         | 
 |    __________________|________                  |  
 |   |    |        |             S                 | 
 |   |    |        |             |                 |  
 |   |    |        |             VP                | 
 |   |    |        |          ___|___              |  
 NP  |    |        |         |       VP            | 
 |   |    |        |         |    ___|______       |  
 NP  |   ADVP     ADJP       |   |          NP     | 
 |   |    |     ___|____     |   |          |      |  
 _   _    _    _        _    _   _          _      _ 
 |   |    |    |        |    |   |          |      |  
 It  's never too      late  to  do     something  . 
emadg commented 3 years ago

@yzhangcs thank you for providing this helpful instruction to get SuPar work with spaCy!

hardianlawi commented 2 years ago

@yzhangcs In terms of the sentence segmentation, isn't supar's better than spacy's? I'm wondering if passing the split sentences from spacy would affect this.

yzhangcs commented 2 years ago

@hardianlawi Sentence segmentation is invisible to SuPar, and the parser is trained on gold segmented sents. So there may be some performance impact from sentences split by spacy, although I have not done a rigorous comparison.

hardianlawi commented 2 years ago

Thanks for your reply @yzhangcs .

the parser is trained on gold segmented sents.

Does this also apply to dependency parser? This means that ideally the texts passed to SuPar parser should be properly segmented, right?

yzhangcs commented 2 years ago

@hardianlawi Yes. You need some tools like udpipe/stanza/spacy to do such preprocess steps.