Closed wkiri closed 2 years ago
@wkiri I've updated the parser scripts to auto-annotate words using a gazette file. The changes I made are currently in target-gazette
branch. I tested the changes using one document, and everything looks correct to me. I will test with more docs next week, and merge the changes to the master
branch.
I added a command line argument -g
or --gazette_file
to the corenlp_parser.py
, jsre_parser.py
, paper_parser.py
, jgr_parser.py
, and lpsc_parser.py
. The -g
or --gazette_file
argument is optional. If a gazette file is provided, we will do gazette target matching and add the matching targets to the jsonl file. Otherwise, the gazette target marching will be disabled. The gazette file must consist of "entity_type entity_name". It is ok to have multiple entity types (e.g., Element, Mineral, Target), and only Target entities will be used.
The format of the gazette targets written to the jsonl file is similar to the format of NER targets. The only difference is the source
field. The source
fields of gazette targets will have the value gazette
, and the source
fields of CoreNLP targets will have the value corenlp
. Please see the following gazette targets as examples:
{'text': u'Rocknest', 'begin': 1932, 'end': 1940, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 1987, 'end': 1995, 'source': 'gazette', 'label': 'Target'}
{'text': u'RMI', 'begin': 2841, 'end': 2844, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 3366, 'end': 3374, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 4426, 'end': 4434, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 4474, 'end': 4482, 'source': 'gazette', 'label': 'Target'}
{'text': u'RMI', 'begin': 4629, 'end': 4632, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 5290, 'end': 5298, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 5376, 'end': 5384, 'source': 'gazette', 'label': 'Target'}
{'text': u'Confidence Hills', 'begin': 6715, 'end': 6731, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 6801, 'end': 6809, 'source': 'gazette', 'label': 'Target'}
{'text': u'Rocknest', 'begin': 6903, 'end': 6911, 'source': 'gazette', 'label': 'Target'}
The test I did was using one MSL 2015 doc (2015_2767.pdf
) and the gazette file at/home/youlu/MTE/MTE/ref/targets_minerals-2017-05_elements.gaz.txt
. The result jsonl file is at /home/youlu/MTE/working_dir/target_gazette/2767_gazette.jsonl
. The gazette targets listed above are found in my test run. These gazette targets will be filtered by the NER targets. If a target is already detected by CoreNLP, it won't be written to the jsonl file. In this case, Confidence Hills
isn't written to the jsonl file because CoreNLP has already detected it as a Target.
@wkiri I currently don't have a good way to test the begin
and end
indices. Do you think we can use brat to test them? I can manually create .ann file using these gazette targets, but I am not sure how to set up the brat service.
@stevenlujpl Thank you so much for this addition! I will give it a try.
To view .ann
files with brat, log in to mlia-web-external
and create a directory under /var/www/brat/data/test/youlu/<your_dir>
. Copy your .txt
and .ann
files here. Make the directory and files world-readable and executable (chmod -R a+rX <your_dir>
) and change it to group apache
(chgrp -R apache <your_dir>
). Then visit https://ml.jpl.nasa.gov/mte/brat/#/test/youlu/
I also really like that these entities have a different source
value. Great idea!
It seems there is a gap here. The code in corenlp_parser.py
uses the gazette to update the ner
list (for the whole document), but jSRE looks at the per-token annotations which each have a ner
field. As a result, none of the newly added NERs are seen by jSRE, and no new relations are generated. I'm looking into this.
Yes, there indeed is a gap. In addition to update the ner
list, I should also update the per-token annotations to overwrite the CoreNLP NER labels.
I have made this update. We now detect more relations!
I also had to update the logic in json2brat.py
to handle cases where we have a relation detected separately for a token within a multi-word entity (since this caused duplicate relations). I have fixed this as well.
You can see test output for two documents here:
https://ml.jpl.nasa.gov/mte/brat/#/mer-a/mer-a-brat-gazette-testrel/
I am now running this for the full MER collection. Note: the gazette addition increases runtime significantly. Previously it took about 17 minutes to run lpsc_parser.py
on this collection, and now it takes about 58 minutes. If we get much higher recall out of this, it is worth additional runtime!
The results are in!
Docs | Elements | Minerals | Targets | Relations (Contains) | Property | # docs with annotations | # docs with at least one target | |
---|---|---|---|---|---|---|---|---|
Original NER | 597 | 4894 | 6877 | 60 (37 unique) | 10 | 12400 | 597 - 16 = 581 | 34 |
Salient target NER | 597 | 4893 | 6868 | 62 (38 unique): added DodoGoldilocks | 10 | 12401 | 597 - 16 = 581 | 34 |
Salient target NER + gazette | 597 | 4893 | 6868 | 1714 | 1697 (94 docs) | 12401 | 597 - 13 = 584 | 245 |
The annotated (brat) files are here: https://ml.jpl.nasa.gov/mte/brat/#/mer-a/all-jsre-gazette/
I will review these for initial quality control and identify which documents will pass on for expert review.
I did a quick review and removed obviously spurious Target annotations and any associated Contains relations.
We now have: | Docs | Elements | Minerals | Targets | Relations (Contains) | Property | # docs with annotations | # docs with at least one target | |
---|---|---|---|---|---|---|---|---|---|
Original NER | 597 | 4894 | 6877 | 60 (37 unique) | 10 | 12400 | 597 - 16 = 581 | 34 | |
Salient target NER | 597 | 4893 | 6868 | 62 (38 unique): added DodoGoldilocks | 10 | 12401 | 597 - 16 = 581 | 34 | |
Salient target NER + gazette | 597 | 4893 | 6868 | 1714 | 1697 (94 docs) | 12401 | 597 - 13 = 584 | 245 | |
Salient target NER + gazette + Kiri's quick review | 597 | 4893 | 6868 | 1531 | 1665 (86 docs) | 12401 | 597 - 14 = 583 | 161 |
Next I will ask for expert review (10 docs each from the 86 with Contains relations) and we'll get feedback to see if any changes need to be made to our processing before reviewing the rest.
The changes have been merged into the master
branch. I will close this issue.
We want to increase recall of Target types specifically. One way is to do a string-based matching from a gazette of entity terms. We can provide a gazette file that consists of "Entity_type Entity_name" pairs to inform the string matching.
Steven suggests integrating this capability into
corenlp_parser.py
. It would happen after applying the trained NER model and before running relation extraction.-g <gazette_file>
which if specified would activate this capabilitypre_annotate.py
script can be an inspiration for this capability (but does not need to be followed exactly).