Parenthesis inside tokens are not parsable

yzhangcs / parser

:rocket: State-of-the-art parsers for natural language.

https://parser.yzhang.site/

MIT License

825 stars 138 forks source link

Parenthesis inside tokens are not parsable #120

Closed yochail closed 1 year ago

yochail commented 1 year ago

Hi, Thanks for all the good work!

I'm not working with a stanza tokenizer, but with a customized spacy tokenizer, but I think that like my tokenizer, for the input "hello world :)" stanza will return the tokens ["hello" "world" ":)"] which will throw an exception inside NLTK dataset constructor.

This issue was partly fixed, and this fix is working for single Tokens like '(' or ')', but for tokens with more than a single parenthesis, mainly emojis like ':)', '(8', etc. the issue still exists and causes an error.

yzhangcs commented 1 year ago

@yochail Hi, could you provide me the constructed examples that will raise errors? (please take care to use the latest commits) I've tried this, and the outputs are ok:

>>> Parser.load('crf-con-en').predict(["hello world :)"], lang='en', verbose=False)[0].pretty_print()
       TOP        
        |          
        NP        
   _____|_____     
  _     _     _   
  |     |     |    
hello world :-RRB-

yochail commented 1 year ago

@yzhangcs Thanks! I see it was already fixed in commit https://github.com/yzhangcs/parser/commit/fe5c20395950187989955444a72c8a0e021ee389 but I didn't get the fix since I'm using the latest release 1.1.4 Are there any plans to release a new stable version in the near future?

yzhangcs commented 1 year ago

@yochail Yeah, I do plan to release a big version 1.2. However there are many things remaining to be updated, e.g., the config system, more efficient algs, more tasks, and more models. So, I can't guarantee a firm date, maybe the next few months... If you urgently need it, try installing the latest code from source code :)

pip install -U git+https://github.com/yzhangcs/parser