integrated tokenizer - Githubissues

attardi commented 3 years ago

A simple change is needed in order to integrate a tokenizer. In file utils/transform.py, to method CoNLL.transform.init(), add the optional parameter

reader=open

and then set

self.reader=reader

and in CoNLL.load(), change it to use it:

    if isinstance(data, str):
        if not hasattr(self, 'reader'): self.reader = open # back compatibility       
        with self.reader(data) as f:
            lines = [line.strip() for line in f]

You can then pass as reader a nltk tokenizer or a Stanza tokenizer. I use this code to interface tp Stanza:

tokenizer.py.txt

yzhangcs commented 3 years ago

Could give me some examples. What do you expect to return if passing a line in which word is tokenized into several pieces?

attardi commented 3 years ago

I am passing it a plain text, either a file or a string. It invokes the tokenizer for the given language and get the output in CoNLL format for reading by the parser.

I enclose a fix to the code. tokenizer.py.txt

yzhangcs commented 3 years ago

@attardi that's seems feasible.

yzhangcs / parser

integrated tokenizer #47