wkiri / MTE

Mars Target Encyclopedia
Apache License 2.0
5 stars 0 forks source link

Update jSRE model for "contains" relation using all annotated docs #20

Closed wkiri closed 2 years ago

wkiri commented 2 years ago

This model will supersede the current jSRE contains model, which was trained only on LPSC 2015 documents.

stevenlujpl commented 2 years ago

@wkiri I am a bit confused about the MSL annotations to use for training. Looking at the README file, it seems the following directories contain the MSL annotations. Could you please confirm if they are the correct MSL annotations to use for training?

Please note the following paths are under my local checkout of the MTE git repository.

wkiri commented 2 years ago

@stevenlujpl Yes, those should be the correct MSL annotation source directories.

stevenlujpl commented 2 years ago

@wkiri I've updated the scripts to add support for event type Contains relation, and trained a jSRE contains model. The model was trained using all the annotations from the following 4 directories:

  1. MSL 2015: /home/youlu/MTE/MTE/corpus-LPSC/lpsc15-C-raymond-sol1159-v3-utf8/
  2. MSL 2016: /home/youlu/MTE/MTE/corpus-LPSC/lpsc16-C-raymond-sol1159-utf8/
  3. MPF: /proj/mte/results/mpf-reviewed+properties-v2/
  4. PHX: /proj/mte/results/phx-reviewed+properties-v2/

There are a total of 1,432 examples created, and 476 examples are positive and 956 examples are negative. Please see below for the performance of the model evaluated on the training data.

Accuracy = 98.6731843575419% (1413/1432) (classification)
Mean squared error = 0.013268156424581005 (regression)
Squared correlation coefficient = 0.9425045433413635 (regression)
c   tp  fp  fn  total   prec    recall  F1
1   476 19  0   1432    0.962   1.000   0.980

The model is available at the following location:

/proj/mte/trained_models/jSRE-contains-msl-mpf-phx.model

The changes to the scripts are in issue20 branch now. I will open a PR to merge the changes to the master branch.

wkiri commented 2 years ago

@stevenlujpl Please report the number of training examples with just MSL 2015 vs. with the full data set now. Thank you!

wkiri commented 2 years ago

@stevenlujpl plans to compare the relations found on MER-B with the old and new Contains models.

stevenlujpl commented 2 years ago

@wkiri Please see the number of training examples for MSL 2015 v.s. full data set below:

Data set Positive examples Negative examples Total examples
MSL 2015 264 415 679
Full data set 476 956 1432
wkiri commented 2 years ago

@stevenlujpl I am getting a jSRE null pointer exception for one MER-A file, which processed fine in the past.

Exception in thread "main" java.lang.NullPointerException
        at java.io.FileInputStream.<init>(FileInputStream.java:130)
        at org.itc.irst.tcc.sre.Predict.readParameters(Predict.java:90)
        at org.itc.irst.tcc.sre.Predict.run(Predict.java:154)
        at org.itc.irst.tcc.sre.Predict.main(Predict.java:214)

The lpsc_parser log file indicates:

[2021-11-10 11:07:35]: Processing 2004_1165.pdf
[2021-11-10 11:07:40]: /home/wkiri/Research/MTE/git/src/jsre_parser.py:168: UserWarning: jSRE output file not found, which indicates jSRE run may be failed.
  warnings.warn('jSRE output file not found, which indicates jSRE '

Here is the command I am running:

$ export JSON_FILE=/proj/mte/results/mer-a-jsre-v2-ads-gaz-C2HP.jsonl
$ export NER_MODEL=/proj/mte/trained_models/ner_MERA-property-salient.ser.gz
$ export GAZETTE=../../git/ref/MERA_salient_targets_minerals-2017-05_elements.gaz.txt
$ python ../../git/src/lpsc_parser.py -i /proj/mte/data/corpus-lpsc/mer-pdf/2004_1165.pdf -o $JSON_FILE -jr /proj/mte/jSRE/jsre-1.1 -n $NER_MODEL -g $GAZETTE -rt Contains HasProperty -jm /proj/mte/trained_models/jSRE-{contains-msl-mpf-phx,hasproperty-mpf-phx-reviewed-v2}.model -l mer-a-contains2-hasproperty.log

If I change the final command to use the previous Contains model, there is no error:

$ python ../../git/src/lpsc_parser.py -i /proj/mte/data/corpus-lpsc/mer-pdf/2004_1165.pdf -o $JSON_FILE -jr /proj/mte/jSRE/jsre-1.1 -n $NER_MODEL -g $GAZETTE -rt Contains HasProperty -jm /proj/mte/trained_models/jSRE-{lpsc15-merged-binary,hasproperty-mpf-phx-reviewed-v2}.model -l mer-a-contains2-hasproperty.log

I wonder what is different here. Can you reproduce this error? I notice that there is a left over directory /tmp/jsre_example_30533/, but it seems that should not conflict with my jSRE run, right? If you remove this directory, does the problem persist? Other ideas?

stevenlujpl commented 2 years ago

@wkiri I've investigated this problem. First of all, I can reproduce the error you encountered with the jSRE contains model at /proj/mte/trained_models/jSRE-contains-msl-mpf-phx.model. The left over directory /tmp/jsre_example_30533/ isn't the problem. I tried using other directories to store jSRE temporary files, and the problem still persists.

I still cannot determine the exact reason for the failure, but I suspect the jSRE model at /proj/mte/trained_models/jSRE-contains-msl-mpf-phx.model is corrupted somehow due to unexpected I/O problems when the model was saved to the disk. If the model is corrupted, then it should fail not only just on 2004_1165.pdf file, but also all the other files. I am running an experiment on all 1691 documents from /proj/mte/data/corpus-lpsc/mer-pdf/, and so far I am seeing a lot of NullPointerExceptions. I will report more details when the run completes.

The following Java function in /proj/mte/jSRE/jsre-1.1/src/org/itc/irst/tcc/sre/Predict.java caused the NullPointerException (specifically, the parameter.load line). The File object paramFile is encoded in the model object. I think there are two reasons that might have caused the NullPointerException: (1) model.get("param") returns a File object that contains a relative path from the model file to a parameter file; (2) the model file is corrupted when it was saved to disk during the training process. I conducted two experiments that ruled out reason (1). I re-trained a new jSRE contains model with the same input files from MPF, PHX, LPSC15, and LPSC16, and I am testing the new model on all documents from /proj/mte/data/corpus-lpsc/mer-pdf/ directory. 705 documents have been processed so far, and everything looks good. If the run with the new model completes without NullPointerExpcetion, I will move it to /proj/mte/trained_models. This new model was trained using the exact same input files and parameter settings as the model at /proj/mte/trained_models/jSRE-contains-msl-mpf-phx.model, and the performance numbers are also exactly the same.

private void readParameters(UnZipModel model) throws IOException
{
                logger.info("read parameters");

                // get the param model
                File paramFile = model.get("param");
                parameter.load(new FileInputStream(paramFile));
}
stevenlujpl commented 2 years ago

@wkiri I've copied the new jSRE contains model to the /proj/mte/trained_models directory:

/proj/mte/trained_models/jSRE-contains-msl-mpf-phx.model

Please let me know if you have problems using it.

wkiri commented 2 years ago

@youlu Thank you, I am currently running lpsc_parser.py on the MER-A docs with this model. It will take a little while, but it has already processed the document that generated an error above without problems. (As we expect :) )

wkiri commented 2 years ago

This process is complete. Please see the MER-A JSON file using the new Contains model here: /proj/mte/results/mer-a-jsre-v2-ads-gaz-C2HP.jsonl and compare with content in /proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP.jsonl

wkiri commented 2 years ago

I am regenerating these outputs and will post here when they are ready.

wkiri commented 2 years ago

The JSON files are now available as follows:

Please share the outcome of your comparison script when you have a chance (low priority).

stevenlujpl commented 2 years ago

@wkiri Thanks for re-generating the jsonl files. I re-ran the comparison script, and it is surprising that the Contains relations in the two jsonl files are exactly the same. Please see the output below from the comparison script:

JSONL file 1: mer-a-jsre-v2-ads-gaz-CHP-redo.jsonl
JSONL file 2: mer-a-jsre-v2-ads-gaz-C2HP-redo.jsonl
Relation types included for comparison: ["Contains"]
Total relations found in JSONL file 1: 4165
Total relations found in JSONL file 2: 4165
Total common relations found in both JSONL files: 4165
Unique relations found only in JSONL file 1: 0
Unique relations found only in JSONL file 2: 0
wkiri commented 2 years ago

@stevenlujpl Wow, that does not seem possible. I have double-checked that I specified different jSRE models to each run, and the input jSRE files are quite different in size. I will think about it some more. Let's set this aside for now in favor of issue #9.

wkiri commented 2 years ago

I think we can close this issue for now.