wkiri / MTE

Mars Target Encyclopedia
Apache License 2.0
5 stars 0 forks source link

Integrate unary parser into MTE pipeline #30

Closed wkiri closed 2 years ago

stevenlujpl commented 2 years ago

@wkiri I've added Yuan's unary_parser.py into the MTE pipeline. I tested on a small set of LPSC docs, and the script ran to completion fine.

MTE virtual environment

I created a Python 2.7 virtual environment at /proj/mte/venv/ using the mteuser user, and you can activate the virtual environment using source /proj/mte/venv/bin/activate. This virtual environment should contain everything we need to run the parser scripts, but I didn't test if it contains the dependencies we need to run other scripts in the MTE repo. If you find missing dependencies there, please (1) let me know and I will install them, or (2) switch to the mteuser user and install them yourself.

Containee and Container models

I copied the containee and container models from Yuan's home dir (/home/yzhuang/MTE/trained_models/within_sentence_unary_classifiers/) to /proj/mte/trained_models/ dir, and renamed the model files to containee_model_20210902.ckpt and container_model_20210902.ckpt. Please feel free to rename them.

Example command to run lpsc_parser.py with unary classifiers

python lpsc_parser.py -li lpsc.list -o lpsc.jsonl -l lpsc.log -n /proj/mte/trained_models/mpf_ner_train_lpsc15n16_emt_gazette.ser.gz -cnte /PATH/TO/CONTAINEE/MODEL -cntr /PATH/TO/CONTAINER/MODEL -m closest_container_closest_containee -gid -1

The command is an example to run lpsc_parser.py with the unary classifiers. The option -cnte is the path to the containee model; the option -cntr is the path to the container model; the option -m is the entity linking method and need to be provided from the list ['closest_container_closest_containee', 'closest_target_closest_component', 'closest_containee', 'closest_container', 'closest_component', 'closest_target'] (please run python lpsc_parser.py -h to see what each entity linking method does); the option gid is the gpu id, and it is a negative value (e.g., -1), then the provide will run on cpu.

Temporary solution to enable jsre or unary classifiers for lpsc_parser.py script

I implemented a temporary solution to enable jsre or unary classifiers for lpsc_parser.py script. Please note that I only added the temporary solution to the lpsc_parser.py script now. Once I have a better solution, I will add it to all the necessary parser scripts.

To run lpsc_parser.py script with jsre, we need to provide a valid path to a trained jsre model using the -jm option and leave the unary classifier options (i.e., -cnte, -cntr, and -m) empty.

To run lpsc_parser.py script with unary classifiers, we need to use unary classifier options (-cnte, -cntr, and -m) and leave the jsre option -jm empty.

Please note that if both jsre and unary classifier options are all provided, the lpsc_parser.py may not work as expected. I only implemented this temporary solution because (1) I think we want to use the unary parser on the MER docs ASAP, and (2) I cannot think of a better solution now.

wkiri commented 2 years ago

@stevenlujpl Thank you for this excellent progress! I will try it out on the MER-A documents and let you know how it goes. I think your temporary solution is great for now.

stevenlujpl commented 2 years ago

@wkiri It will be great to record the runtime of the run on MER-A docs. If it is too slow, we can consider parallelize the code and/or move the MTE pipeline to a GPU machine.

I think the corenlp server currently running on port 9000 isn't working. I don't know if you were running something or not, so I didn't restart it. I tested the lpsc_parser.py script on another corenlp server I started on port 9001. You might want to restart the one on port 9000 before the MER-A run.

wkiri commented 2 years ago

@stevenlujpl I am running the unary parser on the MER-A documents (n=1303). It is predicting a total runtime of about 2 hours on mlia-compute. I am also using the -g gazette option. I'll share the final runtime when it completes.

I had to make a small change to the code to add a missing comma in the list of entity linking options. Please take a look at the above commit when you have a chance.

I got this message; is it expected? No handlers could be found for logger "transformers.data.metrics"

stevenlujpl commented 2 years ago

@wkiri I am not sure about the missing comma for the entity_linking_method. It seems the comma was there when I checked the code into the repo. Please see this commit (https://github.com/wkiri/MTE/commit/5ca0096081463366f244bd3ce0d4874e79ede0e4#diff-f5574899976f4cba6dd7a265f3714bf2e9ddf6219fb76f1b68d8a158783374a5). I also checked my local checkout (I haven't pulled your change yet), and the comma is there. I don't understand what is going on, but I think your change is necessary if the comma wasn't in your checkout.

I saw the same message as well. This message was printed from the transformers package. It didn't affect anything, so I just left it there.

wkiri commented 2 years ago

@stevenlujpl See line 662 in unary_parser.py in the commit you linked. The comma is missing. I wonder if you fixed it locally and had not yet pushed it? At any rate, I think it is fine as long as there are no merge conflicts.

stevenlujpl commented 2 years ago

@wkiri You are right. I was looking at the wrong place. It is odd that the PyCharm IDE doesn't flag the missing comma as an error, and the script somehow ran fine.

wkiri commented 2 years ago

@stevenlujpl I guess it is valid Python syntax, and the two strings get concatenated - not what you intended, but not a syntax error :)

wkiri commented 2 years ago

In the end it took 2hrs 15 minutes to run, which is almost exactly the same as the jSRE version (interesting). For the 1303 MER-A documents, jSRE found 225 with at least one Contains relation, while the unary classifier found 168 with at least one Contains relation.

I am currently not able to extract any of the relations into .ann files for individual inspection/review, because the values in cont_ids don't use the entity types in the NER annotations (for matching). They should be of the form element_xxxx_yyyy (where xxxx and yyyy are the span starts/stops), but instead I see component_xxxx_yyyy. The xxxx and yyyy values are correct, but calling it "component" does not match with the NER annotations (must be "element" or "mineral"). It may also be the case that some targets end up as container_xxxx_yyyy instead of target_xxxx_yyyy.

@stevenlujpl would you be well positioned to make this update to the unary parser (to use the NER types, not the unary relation types) or should we ask Yuan to look into it?

stevenlujpl commented 2 years ago

@wkiri I think I should be able to make this update in the unary parser. If I cannot figure it out, I will ask Yuan's help.

wkiri commented 2 years ago
$ python ../../git/src/lpsc_parser.py -li pdfpaths-$MISSION.list -o $JSON_FILE -jr /proj/mte/jSRE/jsre-1.1 -n $NER_MODEL -g $GAZETTE -cnte /proj/mte/trained_models/containee_model_20210902.ckpt -cntr /proj/mte/trained_models/container_model_20210902.ckpt -m closest_container_closest_containee -gid -1
wkiri commented 2 years ago

@stevenlujpl Here are 10 MER documents to test on (these have at least one Contains relation according to the unary classifier):

/proj/mte/data/corpus-lpsc/mer-pdf/2004_1770.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2004_2167.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2004_2184.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2004_2186.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2004_2187.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2004_2188.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2005_1202.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2005_1358.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2005_1413.pdf
/proj/mte/data/corpus-lpsc/mer-pdf/2005_1571.pdf
stevenlujpl commented 2 years ago

@wkiri It seems that there are 0 relations detected from the 10 MER documents above (due to 0 targets found). I am wondering if I used the wrong gazette file or NER model. The gazette file I used is MERA-targets-final.gaz.txt from the MTE repo, and the NER model I used is /proj/mte/trained_models/ner_MERA-property-salient.ser.gz. The command I used is shown below (test_docs.txt is the list file contains the 10 MER docs).

python ~/MTE/MTE/src/lpsc_parser.py -li ./test_docs.txt -o test_docs.jsonl -l test_docs.log -n /proj/mte/trained_models/ner_MERA-property-salient.ser.gz -g ~/MTE/MTE/ref/MER/MERA-targets-final.gaz.txt -cnte /proj/mte/trained_models/containee_model_20210902.ckpt -cntr /proj/mte/trained_models/container_model_20210902.ckpt -m closest_container_closest_containee -gid -1
stevenlujpl commented 2 years ago

Never mind. There were some inconsistencies in my local git checkout. Problem resolved.

stevenlujpl commented 2 years ago

@wkiri

I am currently not able to extract any of the relations into .ann files for individual inspection/review, because the values in cont_ids don't use the entity types in the NER annotations (for matching). They should be of the form element_xxxx_yyyy (where xxxx and yyyy are the span starts/stops), but instead I see component_xxxx_yyyy. The xxxx and yyyy values are correct, but calling it "component" does not match with the NER annotations (must be "element" or "mineral").

The problem for cont_ids should have been resolved. The cont_ids now should be either element_xxxx_yyyy or mineral_xxxx_yyyy. The fix is simple - we just need to keep tracking of the NER's original label before it is changed from "element" or "mineral" to "component", and Yuan's object-oriented coding style made the fix even simpler. Please see the commit above for details. The commit is currently in the issue30 branch, and hasn't been merged to the master branch yet. I will merge the changes to the master branch once you have an opportunity to test and confirm the fix.

It may also be the case that some targets end up as container_xxxx_yyyy instead of target_xxxx_yyyy.

For the 10 documents I used to test, I didn't find any target whose target id is in the form of container_xxxx_yyyy. Looking at the code, I don't even think it is possible for target ids to be in the form of container_xxxx_yyyy. The first word (e.g., container) comes from our trained NER model. It can only be target, element, mineral, or O without explicit modification. Could you please provide me the document which you saw that the target whose target id is in the form of container_xxxx_yyyy? Thanks.

wkiri commented 2 years ago

Could you please provide me the document which you saw that the target whose target id is in the form of container_xxxx_yyyy? Thanks.

I didn't observe this myself, which is why I said "it may be the case that..." At the time I was still (incorrectly) thinking that this was because the code was marking entities with their unary relation type (container/containee). Instead it was just a merging of element/mineral into component. So probably this is now fine! I will give it a try early next week.

stevenlujpl commented 2 years ago

@wkiri I see. Please let me know how it goes.

I confirmed that when the element/mineral or mineral/element are next to each other, they will be merged into one component entity.

stevenlujpl commented 2 years ago

@wkiri Based on my understanding of the code for unary_parser.py, the combined entity type will inherit from the former entity. The combined entity type can only be element or mineral. For example, if the combined entity is in the order of element and mineral, the combined entity type will be element; and if the combined entity is in the order of mineral and element, then the combined entity type will be mineral.

wkiri commented 2 years ago

For all 1303 MER-A documents (some relations could be spurious):

stevenlujpl commented 2 years ago

@wkiri I think this issue has been resolved. I will close it, but please feel free to re-open it if necessary.

wkiri commented 2 years ago

Thanks! I agree. The comparison of jSRE/unary relation classifier is captured in #33 .