Closed wkiri closed 2 years ago
Please note that these updates are currently in issue19
branch, and they haven't been merged into the master
branch yet.
@wkiri I updated the jsre parser and lpsc parser to support running multiple jSRE models, and I also added support in json2brat.py
script to convert .jsonl to .ann file when .jsonl has both Contains and HasProperty relations. Please note that these updates are still in the issue19
branch now. My VPN was down at the time I started working on this issue and I thought I couldn't merge without VPN, but I now just realized that I could because this MTE repo is on public github. I will merge all the updates related to issue #19 and issue #34 to master once you have an opportunity to try them out.
To use lpsc parser (or jsre parse) with multiple jsre models, please see the following example command. Please note that the arguments -rt
(or --relation_types
) and -jm
(or jsre_models
) have been changed to list type arguments.
python ~/MTE/MTE/src/lpsc_parser.py -li mera-test.list -o mera-test.jsonl -l mera-test.log -n /proj/mte/trained_models/ner_MERA-property-salient.ser.gz -g ~/MTE/MTE/ref/MER/MERA-targets-final-salient.gaz.txt -rt Contains HasProperty -jm /proj/mte/trained_models/jSRE-lpsc15-merged-binary.model /proj/mte/trained_models/jSRE-hasproperty-mpf-phx-reviewed-v2.model
The arguments of json2brat.py
are also changed. I removed the third argument relation_type
because we can get the relation type from the rel
field in the jsonl file. json2brat.py
now accepts only two arguments. The first argument is the path to the jsonl file, and the second argument is the path to the output directory. Please see the example command below:
python ~/MTE/MTE/src/json2brat.py mera-test.jsonl out_dir/
@wkiri Forgot to mention that I tested using the same 5 documents listed in the post above, and everything looks correct to me. Please let me know if you find any problems.
@stevenlujpl This is very exciting! I'm glad that the list types can work for the arguments. I am eager to give this a try tomorrow.
@stevenlujpl This is working great so far!
One thing I notice is that it is taking a lot longer to parse a document. It takes about 1 minute per document. Previously (when Contains was hard-coded) it took about 7 seconds per document (with only one relation). If I run it now with just the Contains relation, it takes 30 seconds per document. I wonder if there is some additional complexity that was recently added. (I do not have timing results from when you first introduced -rt
above, so it could be a change present in that update as well.)
Btw, I also made the mistake of specifying -rt Contains
but still specifying 2 models in -jm
. Could you add a sanity check that these lists are the same length?
Finally, I wonder if, for generality, we can safely change https://github.com/wkiri/MTE/blob/5fb89f89cdaa5de126979d2af4ddb8f3dc682546/src/json2brat.py#L173-L178 to simply
outf.write(u'R%d\t%s Arg1:T%d Arg2:T%d\n' %
(r[0], r[3], r[1], r[2]))
That way we can still generate relation output from previous JSON files where r[3]
was not capitalized (the relation types in the output .ann
will also not be capitalized - still it is better than skipping the content). I find this helpful for testing. But I'd like to know if you think there is a reason not to do it.
@stevenlujpl I am now running this over the whole MER-A corpus (previously was just testing 10 docs). So far it looks like it is taking 15 sec/doc, which is in line with previous experiments. Possibly the machine was loaded when I did my earlier tests and slowed things done. Please don't worry about runtime until if/when we notice some other slowdown. For now it looks ok! The run should finish in about 4-5 hours.
Here is the command I am using:
$ export JSON_FILE=/proj/mte/results/mer-a-jsre-v2-ads-gaz-v2.jsonl
$ export NER_MODEL=/proj/mte/trained_models/ner_MERA-property-salient.ser.gz
$ export GAZETTE=MERA_salient_targets_minerals-2017-05_elements.gaz.txt
$ python ../../git/src/lpsc_parser.py -li pdfpaths-mer-a.list -o $JSON_FILE -jr /proj/mte/jSRE/jsre-1.1 -n $NER_MODEL -g $GAZETTE -rt Contains HasProperty -jm /proj/mte/trained_models/jSRE-{lpsc15-merged-binary,hasproperty-mpf-phx-reviewed-v2}.model -l mer-a-contains-hasproperty.log
Note that the GAZETTE used here is the full gazette (including Mineral, Element, and Property terms). In your example usage you were using the Target-only gazette. However, perhaps we only need the Target names in this setting (to auto-annotate Targets not found by the NER model). Is that right? If so, I think it is harmless to provide the full gazette (it looks like it only pulls out Target words), but I want to confirm with you.
Note that the GAZETTE used here is the full gazette (including Mineral, Element, and Property terms). In your example usage you were using the Target-only gazette. However, perhaps we only need the Target names in this setting (to auto-annotate Targets not found by the NER model). Is that right? If so, I think it is harmless to provide the full gazette (it looks like it only pulls out Target words), but I want to confirm with you.
@wkiri I checked the implementation in corenlp_parser.py (please see the code below). It is fine to provide a gazette file containing other entity types (Element, Mineral, Property). Only the Target entities will be considered.
@wkiri I think all the problems in this issue have been resolved, and the code was just merged into the master
branch. This issue is ready to be closed. I will keep it open until our MTE meeting next week in case you have other suggestions. Thanks.
@wkiri I've added the support for HasProperty relations in jsre, lpsc, paper, and jgr parsers. I tested with the following 5 documents from
/proj/mte/data/corpus-lpsc/mer-a-with-targets/
, and found 2 HasProperty relations.To use these parsers, please provide the newly added argument
-rt
or--relation_type
. The valid option for-rt
or--relations
iscontains
orhasproperty
. Please see the example command that I used to process the 5 MERA documents.I also added a new argument
relation_type
to thejson2brat.py
script. This argument is used to determine the relation type (eitherContains
orHasProperty
) to write to the ann files. The example command I used to convert jsonl to Brat ann files is shown below:The result .ann files can be found in the following directory if you are interested to take a look:
Please take a look and let me know if you find any problem. Thanks.