Closed wkiri closed 2 years ago
@stevenlujpl Thank you for this excellent progress! I will try it out on the MER-A documents and let you know how it goes. I think your temporary solution is great for now.
@wkiri It will be great to record the runtime of the run on MER-A docs. If it is too slow, we can consider parallelize the code and/or move the MTE pipeline to a GPU machine.
I think the corenlp server currently running on port 9000 isn't working. I don't know if you were running something or not, so I didn't restart it. I tested the lpsc_parser.py
script on another corenlp server I started on port 9001. You might want to restart the one on port 9000 before the MER-A run.
@stevenlujpl I am running the unary parser on the MER-A documents (n=1303). It is predicting a total runtime of about 2 hours on mlia-compute
. I am also using the -g
gazette option. I'll share the final runtime when it completes.
I had to make a small change to the code to add a missing comma in the list of entity linking options. Please take a look at the above commit when you have a chance.
I got this message; is it expected?
No handlers could be found for logger "transformers.data.metrics"
@wkiri I am not sure about the missing comma for the entity_linking_method
. It seems the comma was there when I checked the code into the repo. Please see this commit (https://github.com/wkiri/MTE/commit/5ca0096081463366f244bd3ce0d4874e79ede0e4#diff-f5574899976f4cba6dd7a265f3714bf2e9ddf6219fb76f1b68d8a158783374a5). I also checked my local checkout (I haven't pulled your change yet), and the comma is there. I don't understand what is going on, but I think your change is necessary if the comma wasn't in your checkout.
I saw the same message as well. This message was printed from the transformers
package. It didn't affect anything, so I just left it there.
@stevenlujpl See line 662 in unary_parser.py
in the commit you linked. The comma is missing. I wonder if you fixed it locally and had not yet pushed it? At any rate, I think it is fine as long as there are no merge conflicts.
@wkiri You are right. I was looking at the wrong place. It is odd that the PyCharm IDE doesn't flag the missing comma as an error, and the script somehow ran fine.
@stevenlujpl I guess it is valid Python syntax, and the two strings get concatenated - not what you intended, but not a syntax error :)
In the end it took 2hrs 15 minutes to run, which is almost exactly the same as the jSRE version (interesting). For the 1303 MER-A documents, jSRE found 225 with at least one Contains relation, while the unary classifier found 168 with at least one Contains relation.
I am currently not able to extract any of the relations into .ann files for individual inspection/review, because the values in cont_ids
don't use the entity types in the NER annotations (for matching). They should be of the form element_xxxx_yyyy
(where xxxx and yyyy are the span starts/stops), but instead I see component_xxxx_yyyy
. The xxxx and yyyy values are correct, but calling it "component" does not match with the NER annotations (must be "element" or "mineral"). It may also be the case that some targets end up as container_xxxx_yyyy
instead of target_xxxx_yyyy
.
@stevenlujpl would you be well positioned to make this update to the unary parser (to use the NER types, not the unary relation types) or should we ask Yuan to look into it?
@wkiri I think I should be able to make this update in the unary parser. If I cannot figure it out, I will ask Yuan's help.
$ python ../../git/src/lpsc_parser.py -li pdfpaths-$MISSION.list -o $JSON_FILE -jr /proj/mte/jSRE/jsre-1.1 -n $NER_MODEL -g $GAZETTE -cnte /proj/mte/trained_models/containee_model_20210902.ckpt -cntr /proj/mte/trained_models/container_model_20210902.ckpt -m closest_container_closest_containee -gid -1
@stevenlujpl Here are 10 MER documents to test on (these have at least one Contains relation according to the unary classifier):
/proj/mte/data/corpus-lpsc/mer-pdf/2004_1770.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2004_2167.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2004_2184.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2004_2186.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2004_2187.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2004_2188.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2005_1202.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2005_1358.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2005_1413.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2005_1571.pdf
@wkiri It seems that there are 0 relations detected from the 10 MER documents above (due to 0 targets found). I am wondering if I used the wrong gazette file or NER model. The gazette file I used is MERA-targets-final.gaz.txt from the MTE repo, and the NER model I used is /proj/mte/trained_models/ner_MERA-property-salient.ser.gz
. The command I used is shown below (test_docs.txt
is the list file contains the 10 MER docs).
python ~/MTE/MTE/src/lpsc_parser.py -li ./test_docs.txt -o test_docs.jsonl -l test_docs.log -n /proj/mte/trained_models/ner_MERA-property-salient.ser.gz -g ~/MTE/MTE/ref/MER/MERA-targets-final.gaz.txt -cnte /proj/mte/trained_models/containee_model_20210902.ckpt -cntr /proj/mte/trained_models/container_model_20210902.ckpt -m closest_container_closest_containee -gid -1
Never mind. There were some inconsistencies in my local git checkout. Problem resolved.
@wkiri
I am currently not able to extract any of the relations into .ann files for individual inspection/review, because the values in cont_ids don't use the entity types in the NER annotations (for matching). They should be of the form element_xxxx_yyyy (where xxxx and yyyy are the span starts/stops), but instead I see component_xxxx_yyyy. The xxxx and yyyy values are correct, but calling it "component" does not match with the NER annotations (must be "element" or "mineral").
The problem for cont_ids
should have been resolved. The cont_ids
now should be either element_xxxx_yyyy
or mineral_xxxx_yyyy
. The fix is simple - we just need to keep tracking of the NER's original label before it is changed from "element" or "mineral" to "component", and Yuan's object-oriented coding style made the fix even simpler. Please see the commit above for details. The commit is currently in the issue30
branch, and hasn't been merged to the master
branch yet. I will merge the changes to the master
branch once you have an opportunity to test and confirm the fix.
It may also be the case that some targets end up as container_xxxx_yyyy instead of target_xxxx_yyyy.
For the 10 documents I used to test, I didn't find any target whose target id is in the form of container_xxxx_yyyy
. Looking at the code, I don't even think it is possible for target ids to be in the form of container_xxxx_yyyy
. The first word (e.g., container) comes from our trained NER model. It can only be target
, element
, mineral
, or O
without explicit modification. Could you please provide me the document which you saw that the target whose target id is in the form of container_xxxx_yyyy
? Thanks.
Could you please provide me the document which you saw that the target whose target id is in the form of container_xxxx_yyyy? Thanks.
I didn't observe this myself, which is why I said "it may be the case that..." At the time I was still (incorrectly) thinking that this was because the code was marking entities with their unary relation type (container/containee). Instead it was just a merging of element/mineral into component. So probably this is now fine! I will give it a try early next week.
@wkiri I see. Please let me know how it goes.
I confirmed that when the element/mineral or mineral/element are next to each other, they will be merged into one component entity.
@wkiri Based on my understanding of the code for unary_parser.py, the combined entity type will inherit from the former entity. The combined entity type can only be element or mineral. For example, if the combined entity is in the order of element and mineral, the combined entity type will be element; and if the combined entity is in the order of mineral and element, then the combined entity type will be mineral.
For all 1303 MER-A documents (some relations could be spurious):
@wkiri I think this issue has been resolved. I will close it, but please feel free to re-open it if necessary.
Thanks! I agree. The comparison of jSRE/unary relation classifier is captured in #33 .
@wkiri I've added Yuan's
unary_parser.py
into the MTE pipeline. I tested on a small set of LPSC docs, and the script ran to completion fine.MTE virtual environment
I created a Python 2.7 virtual environment at
/proj/mte/venv/
using themteuser
user, and you can activate the virtual environment usingsource /proj/mte/venv/bin/activate
. This virtual environment should contain everything we need to run the parser scripts, but I didn't test if it contains the dependencies we need to run other scripts in the MTE repo. If you find missing dependencies there, please (1) let me know and I will install them, or (2) switch to themteuser
user and install them yourself.Containee and Container models
I copied the containee and container models from Yuan's home dir (
/home/yzhuang/MTE/trained_models/within_sentence_unary_classifiers/
) to/proj/mte/trained_models/
dir, and renamed the model files tocontainee_model_20210902.ckpt
andcontainer_model_20210902.ckpt
. Please feel free to rename them.Example command to run
lpsc_parser.py
with unary classifiersThe command is an example to run
lpsc_parser.py
with the unary classifiers. The option-cnte
is the path to the containee model; the option-cntr
is the path to the container model; the option-m
is the entity linking method and need to be provided from the list['closest_container_closest_containee', 'closest_target_closest_component', 'closest_containee', 'closest_container', 'closest_component', 'closest_target']
(please runpython lpsc_parser.py -h
to see what each entity linking method does); the optiongid
is the gpu id, and it is a negative value (e.g., -1), then the provide will run on cpu.Temporary solution to enable jsre or unary classifiers for
lpsc_parser.py
scriptI implemented a temporary solution to enable jsre or unary classifiers for
lpsc_parser.py
script. Please note that I only added the temporary solution to thelpsc_parser.py
script now. Once I have a better solution, I will add it to all the necessary parser scripts.To run
lpsc_parser.py
script with jsre, we need to provide a valid path to a trained jsre model using the-jm
option and leave the unary classifier options (i.e.,-cnte
,-cntr
, and-m
) empty.To run
lpsc_parser.py
script with unary classifiers, we need to use unary classifier options (-cnte
,-cntr
, and-m
) and leave the jsre option-jm
empty.Please note that if both jsre and unary classifier options are all provided, the
lpsc_parser.py
may not work as expected. I only implemented this temporary solution because (1) I think we want to use the unary parser on the MER docs ASAP, and (2) I cannot think of a better solution now.