wkiri / MTE

Mars Target Encyclopedia
Apache License 2.0
5 stars 0 forks source link

Support HasProperty relations in `jsre_parser.py` #34

Closed wkiri closed 2 years ago

stevenlujpl commented 2 years ago

@wkiri I've added the support for HasProperty relations in jsre, lpsc, paper, and jgr parsers. I tested with the following 5 documents from /proj/mte/data/corpus-lpsc/mer-a-with-targets/, and found 2 HasProperty relations.

2020_2915.pdf
2020_2929.pdf
2020_2965.pdf
2020_2975.pdf
2020_2998.pdf

To use these parsers, please provide the newly added argument -rt or --relation_type. The valid option for -rt or --relations is contains or hasproperty. Please see the example command that I used to process the 5 MERA documents.

python ~/MTE/MTE/src/lpsc_parser.py -li mera-test.list -o mera-test.jsonl -l mera-test.log -n /proj/mte/trained_models/ner_MERA-property-salient.ser.gz -g /home/youlu/MTE/MTE/ref/MER/MERA-targets-final-salient.gaz.txt -rt hasproperty -jm /home/youlu/MTE/working_dir/jsre_example/mpf-phx-hasproperty.model

I also added a new argument relation_type to the json2brat.py script. This argument is used to determine the relation type (either Contains or HasProperty) to write to the ann files. The example command I used to convert jsonl to Brat ann files is shown below:

python ~/MTE/MTE/src/json2brat.py mera-test.jsonl out_dir hasproperty

The result .ann files can be found in the following directory if you are interested to take a look:

/home/youlu/MTE/working_dir/test_hasproperty/out_dir

Please take a look and let me know if you find any problem. Thanks.

stevenlujpl commented 2 years ago

Please note that these updates are currently in issue19 branch, and they haven't been merged into the master branch yet.

stevenlujpl commented 2 years ago

@wkiri I updated the jsre parser and lpsc parser to support running multiple jSRE models, and I also added support in json2brat.py script to convert .jsonl to .ann file when .jsonl has both Contains and HasProperty relations. Please note that these updates are still in the issue19 branch now. My VPN was down at the time I started working on this issue and I thought I couldn't merge without VPN, but I now just realized that I could because this MTE repo is on public github. I will merge all the updates related to issue #19 and issue #34 to master once you have an opportunity to try them out.

To use lpsc parser (or jsre parse) with multiple jsre models, please see the following example command. Please note that the arguments -rt (or --relation_types) and -jm (or jsre_models) have been changed to list type arguments.

python ~/MTE/MTE/src/lpsc_parser.py -li mera-test.list -o mera-test.jsonl -l mera-test.log -n /proj/mte/trained_models/ner_MERA-property-salient.ser.gz -g ~/MTE/MTE/ref/MER/MERA-targets-final-salient.gaz.txt -rt Contains HasProperty -jm /proj/mte/trained_models/jSRE-lpsc15-merged-binary.model /proj/mte/trained_models/jSRE-hasproperty-mpf-phx-reviewed-v2.model

The arguments of json2brat.py are also changed. I removed the third argument relation_type because we can get the relation type from the rel field in the jsonl file. json2brat.py now accepts only two arguments. The first argument is the path to the jsonl file, and the second argument is the path to the output directory. Please see the example command below:

python ~/MTE/MTE/src/json2brat.py mera-test.jsonl out_dir/
stevenlujpl commented 2 years ago

@wkiri Forgot to mention that I tested using the same 5 documents listed in the post above, and everything looks correct to me. Please let me know if you find any problems.

wkiri commented 2 years ago

@stevenlujpl This is very exciting! I'm glad that the list types can work for the arguments. I am eager to give this a try tomorrow.

wkiri commented 2 years ago

@stevenlujpl This is working great so far!

One thing I notice is that it is taking a lot longer to parse a document. It takes about 1 minute per document. Previously (when Contains was hard-coded) it took about 7 seconds per document (with only one relation). If I run it now with just the Contains relation, it takes 30 seconds per document. I wonder if there is some additional complexity that was recently added. (I do not have timing results from when you first introduced -rt above, so it could be a change present in that update as well.)

Btw, I also made the mistake of specifying -rt Contains but still specifying 2 models in -jm. Could you add a sanity check that these lists are the same length?

Finally, I wonder if, for generality, we can safely change https://github.com/wkiri/MTE/blob/5fb89f89cdaa5de126979d2af4ddb8f3dc682546/src/json2brat.py#L173-L178 to simply

outf.write(u'R%d\t%s Arg1:T%d Arg2:T%d\n' %                     
                           (r[0], r[3], r[1], r[2]))

That way we can still generate relation output from previous JSON files where r[3] was not capitalized (the relation types in the output .ann will also not be capitalized - still it is better than skipping the content). I find this helpful for testing. But I'd like to know if you think there is a reason not to do it.

wkiri commented 2 years ago

@stevenlujpl I am now running this over the whole MER-A corpus (previously was just testing 10 docs). So far it looks like it is taking 15 sec/doc, which is in line with previous experiments. Possibly the machine was loaded when I did my earlier tests and slowed things done. Please don't worry about runtime until if/when we notice some other slowdown. For now it looks ok! The run should finish in about 4-5 hours.

wkiri commented 2 years ago

Here is the command I am using:

$ export JSON_FILE=/proj/mte/results/mer-a-jsre-v2-ads-gaz-v2.jsonl
$ export NER_MODEL=/proj/mte/trained_models/ner_MERA-property-salient.ser.gz
$ export GAZETTE=MERA_salient_targets_minerals-2017-05_elements.gaz.txt
$ python ../../git/src/lpsc_parser.py -li pdfpaths-mer-a.list -o $JSON_FILE -jr /proj/mte/jSRE/jsre-1.1 -n $NER_MODEL -g $GAZETTE -rt Contains HasProperty -jm /proj/mte/trained_models/jSRE-{lpsc15-merged-binary,hasproperty-mpf-phx-reviewed-v2}.model -l mer-a-contains-hasproperty.log

Note that the GAZETTE used here is the full gazette (including Mineral, Element, and Property terms). In your example usage you were using the Target-only gazette. However, perhaps we only need the Target names in this setting (to auto-annotate Targets not found by the NER model). Is that right? If so, I think it is harmless to provide the full gazette (it looks like it only pulls out Target words), but I want to confirm with you.

stevenlujpl commented 2 years ago

Note that the GAZETTE used here is the full gazette (including Mineral, Element, and Property terms). In your example usage you were using the Target-only gazette. However, perhaps we only need the Target names in this setting (to auto-annotate Targets not found by the NER model). Is that right? If so, I think it is harmless to provide the full gazette (it looks like it only pulls out Target words), but I want to confirm with you.

@wkiri I checked the implementation in corenlp_parser.py (please see the code below). It is fine to provide a gazette file containing other entity types (Element, Mineral, Property). Only the Target entities will be considered.

https://github.com/wkiri/MTE/blob/703fa62338e28d389440f85452373ef2a380b25f/src/corenlp_parser.py#L61-L87

stevenlujpl commented 2 years ago

@wkiri I think all the problems in this issue have been resolved, and the code was just merged into the master branch. This issue is ready to be closed. I will keep it open until our MTE meeting next week in case you have other suggestions. Thanks.