Preprocessing script that converts txt file into tacred dataset format

BoPengGit commented 5 years ago

Is there a script that can convert a txt file into the tacred data format that can then be used for predicting using a pre-trained model? I'm wondering about a preprocessing script that can convert a normal txt file into the tacred dataset format?

Thanks and God bless,

yuhaozhang commented 5 years ago

In order to do that, you'll basically need a pipeline for tokenization, pos tagging and named entity recognition. I don't have a script readily available, but it should not be too hard to create one with existing NLP toolkit such as the Stanford CoreNLP toolkit?

On a related note, have you looked at the KBP annotator in CoreNLP (with KBP standing for Knowledge Base Population)? It is a well-packaged pipeline that takes a piece of text as input and outputs relation triples. One difference is that this KBP system is a combination of rules, patterns and a logistic regression classifier, unlike the neural network system in this repo, but the logistic regression classifier is indeed trained on the TACRED dataset, so you should expect decent results from it. More details of this KBP system can be found in this paper.

BoPengGit commented 5 years ago

Hi Yuhao,

That's very interesting and thanks for the reference to the KBP annotator. I will look into it in the next month or so.

If you have any other ideas or suggestions of given an input text, outputting relation triplets, please feel free to post it here.

Thanks and I'll get back to you once I look at it.

Best and God bless,

onehaitao commented 4 years ago

Hi Yuhao,

That's very interesting and thanks for the reference to the KBP annotator. I will look into it in the next month or so.

If you have any other ideas or suggestions of given an input text, outputting relation triplets, please feel free to post it here.

Thanks and I'll get back to you once I look at it.

Best and God bless,

I write a script to convert SemEval2010 to TACRED dataset format. Maybe it will help you

yuhaozhang / tacred-relation

Preprocessing script that converts txt file into tacred dataset format #9