Closed wkiri closed 3 years ago
I have confirmed the ability to train/test a 4.2.0 CoreNLP file with the train_corenlp_ner.sh script using the same 10 documents to train/test (sanity check).
Some notes on the data:
I trained and evaluated a CoreNLP model on a single (random) split (75% train (n=824 docs), 25% test (n=275 docs)) using all 1099 annotated documents from MSL, MPF, and PHX and the MER-A gazette. This took 49 minutes to complete.
The results are quite good, but the totals are dominated by the Property class which is the most common one. As usual, Target is the most challenging class and has the lowest precision and recall.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - CRFClassifier tagged 457579 words in 275 documents at 27367.17 words per second.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Entity P R F1 TP FP FN
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Element 0.9164 0.9180 0.9172 2730 249 244
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Mineral 0.9419 0.9384 0.9402 2043 126 134
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Property 0.9329 0.9519 0.9423 5656 407 286
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Target 0.8636 0.6759 0.7583 171 27 82
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Totals 0.9291 0.9342 0.9317 10600 809 746
I trained a different model without the Property class (29 mins) so it could be better compared with our previous results:
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - CRFClassifier tagged 457579 words in 275 documents at 30327.35 words per second.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Entity P R F1 TP FP FN
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Element 0.9195 0.9213 0.9204 2740 240 234
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Mineral 0.9442 0.9403 0.9422 2047 121 130
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Target 0.8670 0.6957 0.7719 176 27 77
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Totals 0.9275 0.9184 0.9229 4963 388 441
Each entity type does a little better than when Property is also present, probably due to some confusion between Property and other entity types. Overall, this performance is better than what was reported in our IAAI 2018 paper, but they are not directly comparable (due to different test set). In that case we trained on LPSC 2015 (n=62 docs) and tested on 35 docs from LPSC 2016 and obtained totals:
Precision 0.945
Recall 0.777
F1 0.853
(Per-entity results were not entirely reported in the IAAI paper)
By the way, here is how I generated the train/test split:
$ shuf all-files.list | split -d -l $(( `wc -l <all-files.list` * 3 / 4 )) -a 1 --additional-suffix=.list
This generates x0.list
and x1.list
.
The Python script now uses subprocess
to call CoreNLP (via Java) and train the CoreNLP NER model.
I separated out the "generic" properties into a file that can be shared and is now in
/proj/mte/trained_models/corenlp_ner.prop
. The customizable options we care about (training data, gazette file, and output model filename) are inputs to the train_ner.py
script and are passed to CoreNLP as custom command-line arguments.
Example usage:
$ export FILES=all-files.list
$ export CORENLP_PROP=/proj/mte/trained_models/corenlp_ner.prop
$ export GAZETTE=MERA_targets_minerals-2017-05_elements.gaz.txt
$ train_ner.py $FILES $CORENLP_PROP $GAZETTE ner_MERA.ser.gz
The -t
option enables testing, if desired. This means one can provide an already-trained model and test it on (possibly new) input documents. Currently you would need to interactively decline re-training the model, though. We may want to change this to be the default behavior (not over-writing a model if it exists), but with a message to the user so they know to remove the file manually if they want to overwrite it.
I am trying to evaluate when training over all MSL + MPF + PHX docs (including the Property class) but am getting out-of-memory and garbage collection errors from Java, even after increasing heap allocation and adjusting some other variables following advice here: https://nlp.stanford.edu/software/crf-faq.html
That seems a little weird since I was able to train on 75% of the docs as above without trouble. I'll continue to investigate.
I have resolved the memory error. I misread the previous memory specification as 6000m (6 GB), but actually it was 60000m (60 GB). With this change, and specifying qnsize=10
, the training completes in 32 mins.
Performance on the training data (as a sanity check):
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - CRFClassifier tagged 1847235 words in 1099 documents at 28082.02 words per second.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Entity P R F1 TP FP FN
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Element 0.9783 0.9801 0.9792 10895 242 221
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Mineral 0.9967 0.9951 0.9959 8167 27 40
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Property 0.9785 0.9964 0.9874 25023 549 91
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Target 0.9911 0.9408 0.9653 1558 14 98
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Totals 0.9821 0.9902 0.9862 45643 832 450
Good news! I trained this model and saved it in
/proj/mte/trained_models/ner_MERA-property.ser.gz
After processing all 597 MER-A documents (PDF files corresponding to text files in /proj/mte/data/corpus-lpsc/mer-a-with-targets/
), we find 6 documents with relations found by jSRE. These are only "contains" relations. We will need to train a new jSRE model to be sensitive to "has_property" relations as well.
We also need to have more Target entities found by the NER model to enable better relation extraction.
Todo: Retrain NER MER-A model with better gazette using "salient target" list
I removed 37 Target names from the MER-A gazette and re-trained an NER model, then re-applied it to the MER documents.
Model | Element | Mineral | Target | Contains relation | Property | # docs with at least one target |
---|---|---|---|---|---|---|
Original NER | 4894 | 6877 | 60 (37 unique) | 10 | 12400 | 34 / 597 |
Salient target NER | 4893 | 6868 | 62 (38 unique) | 10 | 12401 | 34 / 597 |
There are some minor changes, but it does not solve the bigger problem (low recall of Targets), which affects downstream relation detection.
Given these limitations, we will explore gazette-based augmentation of Target annotations in issue #24 .
I trained a MER-B NER classifier using the same training docs (MPF+PHX+MSL) and the MER-B gazette. The (training) performance results for this model are similar to those achieved for MER-A:
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - CRFClassifier tagged 1847235 words in 1099 documents at 25282.42 words per second.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Entity P R F1 TP FP FN
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Element 0.9788 0.9801 0.9795 10895 236 221
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Mineral 0.9966 0.9951 0.9959 8167 28 40
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Property 0.9783 0.9964 0.9873 25023 555 91
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Target 0.9917 0.9414 0.9659 1559 13 97
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Totals 0.9821 0.9903 0.9862 45644 832 449
The MER-B NER model is available at /proj/mte/trained_models/ner_MERB-property.ser.gz
.
I re-trained the MER-A NER model after identifying 4 more target names to use in the gazette. It took 45 minutes to train the model. The results are very similar to the previous MER-A NER model, as expected. I copied this model to /proj/mte/trained_models/ner_MERA-property-salient.ser.gz
.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - CRFClassifier tagged 1847235 words in 1099 documents at 27344.57 words per second.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Entity P R F1 TP FP FN
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Element 0.9784 0.9800 0.9792 10894 241 222
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Mineral 0.9967 0.9950 0.9959 8166 27 41
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Property 0.9785 0.9963 0.9873 25022 550 92
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Target 0.9911 0.9408 0.9653 1558 14 98
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Totals 0.9821 0.9902 0.9861 45640 832 453
The plan is to re-train the model using all available annotations (MSL, MPF, PHX).