Train CoreNLP named entity recognizer for use with MER documents

wkiri commented 2 years ago

The plan is to re-train the model using all available annotations (MSL, MPF, PHX).

[ ] First, do cross-validation to get a sense of generalization capability. We no longer need to divide train/dev/test temporally. Include Property as an entity type.
[x] Re-train over all annotations (including Property).
[x] Augment with MER-A Target gazette for a MER-A classifier
[x] Augment with MER-B Target gazette for a MER-B classifier

wkiri commented 2 years ago

I have confirmed the ability to train/test a 4.2.0 CoreNLP file with the train_corenlp_ner.sh script using the same 10 documents to train/test (sanity check).

Some notes on the data:

Total combined corpus of annotated documents (MSL + MPF + PHX) = 1099 documents
Possible concerns with using all docs:
- 36 documents overlap between MPF and PHX, but with different annotations; may need to remove duplicates if it adversely impacts the CRF classifier
- "Property" may need its own classifier since it isn't present in MSL docs (and the associated "false negative" examples where a property appears but isn't annotated as such could cause issues)

wkiri commented 2 years ago

I trained and evaluated a CoreNLP model on a single (random) split (75% train (n=824 docs), 25% test (n=275 docs)) using all 1099 annotated documents from MSL, MPF, and PHX and the MER-A gazette. This took 49 minutes to complete.

The results are quite good, but the totals are dominated by the Property class which is the most common one. As usual, Target is the most challenging class and has the lowest precision and recall.

[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - CRFClassifier tagged 457579 words in 275 documents at 27367.17 words per second.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -          Entity    P   R   F1  TP  FP  FN
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -         Element    0.9164  0.9180  0.9172  2730    249 244
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -         Mineral    0.9419  0.9384  0.9402  2043    126 134
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -        Property    0.9329  0.9519  0.9423  5656    407 286
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -          Target    0.8636  0.6759  0.7583  171 27  82
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -          Totals    0.9291  0.9342  0.9317  10600   809 746

I trained a different model without the Property class (29 mins) so it could be better compared with our previous results:

[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - CRFClassifier tagged 457579 words in 275 documents at 30327.35 words per second.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -          Entity    P   R   F1  TP  FP  FN
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -         Element    0.9195  0.9213  0.9204  2740    240 234
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -         Mineral    0.9442  0.9403  0.9422  2047    121 130
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -          Target    0.8670  0.6957  0.7719  176 27  77
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -          Totals    0.9275  0.9184  0.9229  4963    388 441

Each entity type does a little better than when Property is also present, probably due to some confusion between Property and other entity types. Overall, this performance is better than what was reported in our IAAI 2018 paper, but they are not directly comparable (due to different test set). In that case we trained on LPSC 2015 (n=62 docs) and tested on 35 docs from LPSC 2016 and obtained totals:

Precision 0.945
Recall 0.777
F1 0.853

(Per-entity results were not entirely reported in the IAAI paper)

wkiri commented 2 years ago

By the way, here is how I generated the train/test split:

$ shuf all-files.list | split -d -l $(( `wc -l <all-files.list` * 3 / 4 )) -a 1 --additional-suffix=.list

This generates x0.list and x1.list.

wkiri commented 2 years ago

The Python script now uses subprocess to call CoreNLP (via Java) and train the CoreNLP NER model. I separated out the "generic" properties into a file that can be shared and is now in /proj/mte/trained_models/corenlp_ner.prop. The customizable options we care about (training data, gazette file, and output model filename) are inputs to the train_ner.py script and are passed to CoreNLP as custom command-line arguments.

Example usage:

$ export FILES=all-files.list
$ export CORENLP_PROP=/proj/mte/trained_models/corenlp_ner.prop
$ export GAZETTE=MERA_targets_minerals-2017-05_elements.gaz.txt
$ train_ner.py $FILES $CORENLP_PROP $GAZETTE ner_MERA.ser.gz

wkiri commented 2 years ago

The -t option enables testing, if desired. This means one can provide an already-trained model and test it on (possibly new) input documents. Currently you would need to interactively decline re-training the model, though. We may want to change this to be the default behavior (not over-writing a model if it exists), but with a message to the user so they know to remove the file manually if they want to overwrite it.

I am trying to evaluate when training over all MSL + MPF + PHX docs (including the Property class) but am getting out-of-memory and garbage collection errors from Java, even after increasing heap allocation and adjusting some other variables following advice here: https://nlp.stanford.edu/software/crf-faq.html
That seems a little weird since I was able to train on 75% of the docs as above without trouble. I'll continue to investigate.

wkiri commented 2 years ago

I have resolved the memory error. I misread the previous memory specification as 6000m (6 GB), but actually it was 60000m (60 GB). With this change, and specifying qnsize=10, the training completes in 32 mins.

Performance on the training data (as a sanity check):

[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - CRFClassifier tagged 1847235 words in 1099 documents at 28082.02 words per second.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -          Entity    P       R       F1      TP      FP      FN
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -         Element    0.9783  0.9801  0.9792  10895   242     221
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -         Mineral    0.9967  0.9951  0.9959  8167    27      40
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -        Property    0.9785  0.9964  0.9874  25023   549     91
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -          Target    0.9911  0.9408  0.9653  1558    14      98
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -          Totals    0.9821  0.9902  0.9862  45643   832     450

wkiri commented 2 years ago

Good news! I trained this model and saved it in /proj/mte/trained_models/ner_MERA-property.ser.gz

After processing all 597 MER-A documents (PDF files corresponding to text files in /proj/mte/data/corpus-lpsc/mer-a-with-targets/), we find 6 documents with relations found by jSRE. These are only "contains" relations. We will need to train a new jSRE model to be sensitive to "has_property" relations as well.

wkiri commented 2 years ago

We also need to have more Target entities found by the NER model to enable better relation extraction.

Todo: Retrain NER MER-A model with better gazette using "salient target" list

wkiri commented 2 years ago

I removed 37 Target names from the MER-A gazette and re-trained an NER model, then re-applied it to the MER documents.

Model	Element	Mineral	Target	Contains relation	Property	# docs with at least one target
Original NER	4894	6877	60 (37 unique)	10	12400	34 / 597
Salient target NER	4893	6868	62 (38 unique)	10	12401	34 / 597

There are some minor changes, but it does not solve the bigger problem (low recall of Targets), which affects downstream relation detection.

wkiri commented 2 years ago

Given these limitations, we will explore gazette-based augmentation of Target annotations in issue #24 .

wkiri commented 2 years ago

I trained a MER-B NER classifier using the same training docs (MPF+PHX+MSL) and the MER-B gazette. The (training) performance results for this model are similar to those achieved for MER-A:

[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - CRFClassifier tagged 1847235 words in 1099 documents at 25282.42 words per second.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -          Entity    P     R        F1      TP      FP      FN
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -         Element    0.9788 0.9801   0.9795  10895   236     221
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -         Mineral    0.9966 0.9951   0.9959  8167    28      40
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -        Property    0.9783 0.9964   0.9873  25023   555     91
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -          Target    0.9917 0.9414   0.9659  1559    13      97
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -          Totals    0.9821 0.9903   0.9862  45644   832     449

wkiri commented 2 years ago

The MER-B NER model is available at /proj/mte/trained_models/ner_MERB-property.ser.gz.

wkiri commented 2 years ago

I re-trained the MER-A NER model after identifying 4 more target names to use in the gazette. It took 45 minutes to train the model. The results are very similar to the previous MER-A NER model, as expected. I copied this model to /proj/mte/trained_models/ner_MERA-property-salient.ser.gz.

[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - CRFClassifier tagged 1847235 words in 1099 documents at 27344.57 words per second.
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -          Entity   P        R       F1      TP      FP      FN
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -         Element   0.9784   0.9800  0.9792  10894   241     222
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -         Mineral   0.9967   0.9950  0.9959  8166    27      41
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -        Property   0.9785   0.9963  0.9873  25022   550     92
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -          Target   0.9911   0.9408  0.9653  1558    14      98
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier -          Totals   0.9821   0.9902  0.9861  45640   832     453

wkiri / MTE

Train CoreNLP named entity recognizer for use with MER documents #15