Add ability to auto-annotate entities using a gazette

wkiri commented 2 years ago

We want to increase recall of Target types specifically. One way is to do a string-based matching from a gazette of entity terms. We can provide a gazette file that consists of "Entity_type Entity_name" pairs to inform the string matching.

Steven suggests integrating this capability into corenlp_parser.py. It would happen after applying the trained NER model and before running relation extraction.

Suggest adding an option like -g <gazette_file> which if specified would activate this capability
We don't want to generate duplicate annotations, so this step should check to see if a given entity is already marked by the NER system and if so, omit adding the new entity annotation.
The pre_annotate.py script can be an inspiration for this capability (but does not need to be followed exactly).

stevenlujpl commented 2 years ago

@wkiri I've updated the parser scripts to auto-annotate words using a gazette file. The changes I made are currently in target-gazette branch. I tested the changes using one document, and everything looks correct to me. I will test with more docs next week, and merge the changes to the master branch.

I added a command line argument -g or --gazette_file to the corenlp_parser.py, jsre_parser.py, paper_parser.py, jgr_parser.py, and lpsc_parser.py. The -g or --gazette_file argument is optional. If a gazette file is provided, we will do gazette target matching and add the matching targets to the jsonl file. Otherwise, the gazette target marching will be disabled. The gazette file must consist of "entity_type entity_name". It is ok to have multiple entity types (e.g., Element, Mineral, Target), and only Target entities will be used.

The format of the gazette targets written to the jsonl file is similar to the format of NER targets. The only difference is the source field. The source fields of gazette targets will have the value gazette, and the source fields of CoreNLP targets will have the value corenlp. Please see the following gazette targets as examples:

{'text': u'Rocknest', 'begin': 1932, 'end': 1940, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 1987, 'end': 1995, 'source': 'gazette', 'label': 'Target'}
{'text': u'RMI', 'begin': 2841, 'end': 2844, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 3366, 'end': 3374, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 4426, 'end': 4434, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 4474, 'end': 4482, 'source': 'gazette', 'label': 'Target'}
{'text': u'RMI', 'begin': 4629, 'end': 4632, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 5290, 'end': 5298, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 5376, 'end': 5384, 'source': 'gazette', 'label': 'Target'}
{'text': u'Confidence Hills', 'begin': 6715, 'end': 6731, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 6801, 'end': 6809, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 6903, 'end': 6911, 'source': 'gazette', 'label': 'Target'}

The test I did was using one MSL 2015 doc (2015_2767.pdf) and the gazette file at/home/youlu/MTE/MTE/ref/targets_minerals-2017-05_elements.gaz.txt. The result jsonl file is at /home/youlu/MTE/working_dir/target_gazette/2767_gazette.jsonl. The gazette targets listed above are found in my test run. These gazette targets will be filtered by the NER targets. If a target is already detected by CoreNLP, it won't be written to the jsonl file. In this case, Confidence Hills isn't written to the jsonl file because CoreNLP has already detected it as a Target.

stevenlujpl commented 2 years ago

@wkiri I currently don't have a good way to test the begin and end indices. Do you think we can use brat to test them? I can manually create .ann file using these gazette targets, but I am not sure how to set up the brat service.

wkiri commented 2 years ago

@stevenlujpl Thank you so much for this addition! I will give it a try.

To view .ann files with brat, log in to mlia-web-external and create a directory under /var/www/brat/data/test/youlu/<your_dir>. Copy your .txt and .ann files here. Make the directory and files world-readable and executable (chmod -R a+rX <your_dir>) and change it to group apache (chgrp -R apache <your_dir>). Then visit https://ml.jpl.nasa.gov/mte/brat/#/test/youlu/ to view it on the web. Let me know if this works for you.

wkiri commented 2 years ago

I also really like that these entities have a different source value. Great idea!

wkiri commented 2 years ago

It seems there is a gap here. The code in corenlp_parser.py uses the gazette to update the ner list (for the whole document), but jSRE looks at the per-token annotations which each have a ner field. As a result, none of the newly added NERs are seen by jSRE, and no new relations are generated. I'm looking into this.

stevenlujpl commented 2 years ago

Yes, there indeed is a gap. In addition to update the ner list, I should also update the per-token annotations to overwrite the CoreNLP NER labels.

wkiri commented 2 years ago

I have made this update. We now detect more relations!

I also had to update the logic in json2brat.py to handle cases where we have a relation detected separately for a token within a multi-word entity (since this caused duplicate relations). I have fixed this as well.

You can see test output for two documents here:

2005_1710 previously had 0 Targets and 0 relations; now has 11 Targets and 9 relations
2005_2125 had 1 Target and 1 relation; now has 8 Targets and 11 relations

https://ml.jpl.nasa.gov/mte/brat/#/mer-a/mer-a-brat-gazette-testrel/

I am now running this for the full MER collection. Note: the gazette addition increases runtime significantly. Previously it took about 17 minutes to run lpsc_parser.py on this collection, and now it takes about 58 minutes. If we get much higher recall out of this, it is worth additional runtime!

wkiri commented 2 years ago

The results are in!

	Docs	Elements	Minerals	Targets	Relations (Contains)	Property	# docs with annotations	# docs with at least one target
Original NER	597	4894	6877	60 (37 unique)	10	12400	597 - 16 = 581	34
Salient target NER	597	4893	6868	62 (38 unique): added DodoGoldilocks	10	12401	597 - 16 = 581	34
Salient target NER + gazette	597	4893	6868	1714	1697 (94 docs)	12401	597 - 13 = 584	245

The annotated (brat) files are here: https://ml.jpl.nasa.gov/mte/brat/#/mer-a/all-jsre-gazette/

I will review these for initial quality control and identify which documents will pass on for expert review.

wkiri commented 2 years ago

I did a quick review and removed obviously spurious Target annotations and any associated Contains relations.

We now have:		Docs	Elements	Minerals	Targets	Relations (Contains)	Property	# docs with annotations
Original NER	597	4894	6877	60 (37 unique)	10	12400	597 - 16 = 581	34
Salient target NER	597	4893	6868	62 (38 unique): added DodoGoldilocks	10	12401	597 - 16 = 581	34
Salient target NER + gazette	597	4893	6868	1714	1697 (94 docs)	12401	597 - 13 = 584	245
Salient target NER + gazette + Kiri's quick review	597	4893	6868	1531	1665 (86 docs)	12401	597 - 14 = 583	161

Next I will ask for expert review (10 docs each from the 86 with Contains relations) and we'll get feedback to see if any changes need to be made to our processing before reviewing the rest.

stevenlujpl commented 2 years ago

The changes have been merged into the master branch. I will close this issue.

wkiri / MTE

Add ability to auto-annotate entities using a gazette #24