wkiri / MTE

Mars Target Encyclopedia
Apache License 2.0
5 stars 0 forks source link

Add ability to auto-annotate entities using a gazette #24

Closed wkiri closed 2 years ago

wkiri commented 2 years ago

We want to increase recall of Target types specifically. One way is to do a string-based matching from a gazette of entity terms. We can provide a gazette file that consists of "Entity_type Entity_name" pairs to inform the string matching.

Steven suggests integrating this capability into corenlp_parser.py. It would happen after applying the trained NER model and before running relation extraction.

stevenlujpl commented 2 years ago

@wkiri I've updated the parser scripts to auto-annotate words using a gazette file. The changes I made are currently in target-gazette branch. I tested the changes using one document, and everything looks correct to me. I will test with more docs next week, and merge the changes to the master branch.

I added a command line argument -g or --gazette_file to the corenlp_parser.py, jsre_parser.py, paper_parser.py, jgr_parser.py, and lpsc_parser.py. The -g or --gazette_file argument is optional. If a gazette file is provided, we will do gazette target matching and add the matching targets to the jsonl file. Otherwise, the gazette target marching will be disabled. The gazette file must consist of "entity_type entity_name". It is ok to have multiple entity types (e.g., Element, Mineral, Target), and only Target entities will be used.

The format of the gazette targets written to the jsonl file is similar to the format of NER targets. The only difference is the source field. The source fields of gazette targets will have the value gazette, and the source fields of CoreNLP targets will have the value corenlp. Please see the following gazette targets as examples:

{'text': u'Rocknest', 'begin': 1932, 'end': 1940, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 1987, 'end': 1995, 'source': 'gazette', 'label': 'Target'}
{'text': u'RMI', 'begin': 2841, 'end': 2844, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 3366, 'end': 3374, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 4426, 'end': 4434, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 4474, 'end': 4482, 'source': 'gazette', 'label': 'Target'}
{'text': u'RMI', 'begin': 4629, 'end': 4632, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 5290, 'end': 5298, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 5376, 'end': 5384, 'source': 'gazette', 'label': 'Target'}
{'text': u'Confidence Hills', 'begin': 6715, 'end': 6731, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 6801, 'end': 6809, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 6903, 'end': 6911, 'source': 'gazette', 'label': 'Target'}

The test I did was using one MSL 2015 doc (2015_2767.pdf) and the gazette file at/home/youlu/MTE/MTE/ref/targets_minerals-2017-05_elements.gaz.txt. The result jsonl file is at /home/youlu/MTE/working_dir/target_gazette/2767_gazette.jsonl. The gazette targets listed above are found in my test run. These gazette targets will be filtered by the NER targets. If a target is already detected by CoreNLP, it won't be written to the jsonl file. In this case, Confidence Hills isn't written to the jsonl file because CoreNLP has already detected it as a Target.

stevenlujpl commented 2 years ago

@wkiri I currently don't have a good way to test the begin and end indices. Do you think we can use brat to test them? I can manually create .ann file using these gazette targets, but I am not sure how to set up the brat service.

wkiri commented 2 years ago

@stevenlujpl Thank you so much for this addition! I will give it a try.

To view .ann files with brat, log in to mlia-web-external and create a directory under /var/www/brat/data/test/youlu/<your_dir>. Copy your .txt and .ann files here. Make the directory and files world-readable and executable (chmod -R a+rX <your_dir>) and change it to group apache (chgrp -R apache <your_dir>). Then visit https://ml.jpl.nasa.gov/mte/brat/#/test/youlu/ to view it on the web. Let me know if this works for you.

wkiri commented 2 years ago

I also really like that these entities have a different source value. Great idea!

wkiri commented 2 years ago

It seems there is a gap here. The code in corenlp_parser.py uses the gazette to update the ner list (for the whole document), but jSRE looks at the per-token annotations which each have a ner field. As a result, none of the newly added NERs are seen by jSRE, and no new relations are generated. I'm looking into this.

stevenlujpl commented 2 years ago

Yes, there indeed is a gap. In addition to update the ner list, I should also update the per-token annotations to overwrite the CoreNLP NER labels.

wkiri commented 2 years ago

I have made this update. We now detect more relations!

I also had to update the logic in json2brat.py to handle cases where we have a relation detected separately for a token within a multi-word entity (since this caused duplicate relations). I have fixed this as well.

You can see test output for two documents here:

https://ml.jpl.nasa.gov/mte/brat/#/mer-a/mer-a-brat-gazette-testrel/

I am now running this for the full MER collection. Note: the gazette addition increases runtime significantly. Previously it took about 17 minutes to run lpsc_parser.py on this collection, and now it takes about 58 minutes. If we get much higher recall out of this, it is worth additional runtime!

wkiri commented 2 years ago

The results are in!

  Docs Elements Minerals Targets Relations (Contains) Property # docs with annotations # docs with at least one target
Original NER 597 4894 6877 60 (37 unique) 10 12400 597 - 16 = 581 34
Salient target NER 597 4893 6868 62 (38 unique): added DodoGoldilocks 10 12401 597 - 16 = 581 34
Salient target NER + gazette 597 4893 6868 1714 1697 (94 docs) 12401 597 - 13 = 584 245

The annotated (brat) files are here: https://ml.jpl.nasa.gov/mte/brat/#/mer-a/all-jsre-gazette/

I will review these for initial quality control and identify which documents will pass on for expert review.

wkiri commented 2 years ago

I did a quick review and removed obviously spurious Target annotations and any associated Contains relations.

We now have:   Docs Elements Minerals Targets Relations (Contains) Property # docs with annotations # docs with at least one target
Original NER 597 4894 6877 60 (37 unique) 10 12400 597 - 16 = 581 34
Salient target NER 597 4893 6868 62 (38 unique): added DodoGoldilocks 10 12401 597 - 16 = 581 34
Salient target NER + gazette 597 4893 6868 1714 1697 (94 docs) 12401 597 - 13 = 584 245
Salient target NER + gazette + Kiri's quick review 597 4893 6868 1531 1665 (86 docs) 12401 597 - 14 = 583 161

Next I will ask for expert review (10 docs each from the 86 with Contains relations) and we'll get feedback to see if any changes need to be made to our processing before reviewing the rest.

stevenlujpl commented 2 years ago

The changes have been merged into the master branch. I will close this issue.