Closed wkiri closed 2 years ago
@wkiri I am a bit confused about the MSL annotations to use for training. Looking at the README file, it seems the following directories contain the MSL annotations. Could you please confirm if they are the correct MSL annotations to use for training?
Please note the following paths are under my local checkout of the MTE git repository.
/home/youlu/MTE/MTE/corpus-LPSC/lpsc15-C-raymond-sol1159-v3-utf8/
/home/youlu/MTE/MTE/corpus-LPSC/lpsc16-C-raymond-sol1159-utf8/
@stevenlujpl Yes, those should be the correct MSL annotation source directories.
@wkiri I've updated the scripts to add support for event type Contains relation, and trained a jSRE contains model. The model was trained using all the annotations from the following 4 directories:
/home/youlu/MTE/MTE/corpus-LPSC/lpsc15-C-raymond-sol1159-v3-utf8/
/home/youlu/MTE/MTE/corpus-LPSC/lpsc16-C-raymond-sol1159-utf8/
/proj/mte/results/mpf-reviewed+properties-v2/
/proj/mte/results/phx-reviewed+properties-v2/
There are a total of 1,432 examples created, and 476 examples are positive and 956 examples are negative. Please see below for the performance of the model evaluated on the training data.
Accuracy = 98.6731843575419% (1413/1432) (classification)
Mean squared error = 0.013268156424581005 (regression)
Squared correlation coefficient = 0.9425045433413635 (regression)
c tp fp fn total prec recall F1
1 476 19 0 1432 0.962 1.000 0.980
The model is available at the following location:
/proj/mte/trained_models/jSRE-contains-msl-mpf-phx.model
The changes to the scripts are in issue20
branch now. I will open a PR to merge the changes to the master
branch.
@stevenlujpl Please report the number of training examples with just MSL 2015 vs. with the full data set now. Thank you!
@stevenlujpl plans to compare the relations found on MER-B with the old and new Contains models.
@wkiri Please see the number of training examples for MSL 2015 v.s. full data set below:
Data set | Positive examples | Negative examples | Total examples |
---|---|---|---|
MSL 2015 | 264 | 415 | 679 |
Full data set | 476 | 956 | 1432 |
@stevenlujpl I am getting a jSRE null pointer exception for one MER-A file, which processed fine in the past.
Exception in thread "main" java.lang.NullPointerException
at java.io.FileInputStream.<init>(FileInputStream.java:130)
at org.itc.irst.tcc.sre.Predict.readParameters(Predict.java:90)
at org.itc.irst.tcc.sre.Predict.run(Predict.java:154)
at org.itc.irst.tcc.sre.Predict.main(Predict.java:214)
The lpsc_parser log file indicates:
[2021-11-10 11:07:35]: Processing 2004_1165.pdf
[2021-11-10 11:07:40]: /home/wkiri/Research/MTE/git/src/jsre_parser.py:168: UserWarning: jSRE output file not found, which indicates jSRE run may be failed.
warnings.warn('jSRE output file not found, which indicates jSRE '
Here is the command I am running:
$ export JSON_FILE=/proj/mte/results/mer-a-jsre-v2-ads-gaz-C2HP.jsonl
$ export NER_MODEL=/proj/mte/trained_models/ner_MERA-property-salient.ser.gz
$ export GAZETTE=../../git/ref/MERA_salient_targets_minerals-2017-05_elements.gaz.txt
$ python ../../git/src/lpsc_parser.py -i /proj/mte/data/corpus-lpsc/mer-pdf/2004_1165.pdf -o $JSON_FILE -jr /proj/mte/jSRE/jsre-1.1 -n $NER_MODEL -g $GAZETTE -rt Contains HasProperty -jm /proj/mte/trained_models/jSRE-{contains-msl-mpf-phx,hasproperty-mpf-phx-reviewed-v2}.model -l mer-a-contains2-hasproperty.log
If I change the final command to use the previous Contains model, there is no error:
$ python ../../git/src/lpsc_parser.py -i /proj/mte/data/corpus-lpsc/mer-pdf/2004_1165.pdf -o $JSON_FILE -jr /proj/mte/jSRE/jsre-1.1 -n $NER_MODEL -g $GAZETTE -rt Contains HasProperty -jm /proj/mte/trained_models/jSRE-{lpsc15-merged-binary,hasproperty-mpf-phx-reviewed-v2}.model -l mer-a-contains2-hasproperty.log
I wonder what is different here. Can you reproduce this error? I notice that there is a left over directory /tmp/jsre_example_30533/
, but it seems that should not conflict with my jSRE run, right? If you remove this directory, does the problem persist? Other ideas?
@wkiri I've investigated this problem. First of all, I can reproduce the error you encountered with the jSRE contains model at /proj/mte/trained_models/jSRE-contains-msl-mpf-phx.model
. The left over directory /tmp/jsre_example_30533/
isn't the problem. I tried using other directories to store jSRE temporary files, and the problem still persists.
I still cannot determine the exact reason for the failure, but I suspect the jSRE model at /proj/mte/trained_models/jSRE-contains-msl-mpf-phx.model
is corrupted somehow due to unexpected I/O problems when the model was saved to the disk. If the model is corrupted, then it should fail not only just on 2004_1165.pdf
file, but also all the other files. I am running an experiment on all 1691 documents from /proj/mte/data/corpus-lpsc/mer-pdf/
, and so far I am seeing a lot of NullPointerException
s. I will report more details when the run completes.
The following Java function in /proj/mte/jSRE/jsre-1.1/src/org/itc/irst/tcc/sre/Predict.java
caused the NullPointerException
(specifically, the parameter.load
line). The File
object paramFile
is encoded in the model
object. I think there are two reasons that might have caused the NullPointerException
: (1) model.get("param")
returns a File
object that contains a relative path from the model file to a parameter file; (2) the model file is corrupted when it was saved to disk during the training process. I conducted two experiments that ruled out reason (1). I re-trained a new jSRE contains model with the same input files from MPF, PHX, LPSC15, and LPSC16, and I am testing the new model on all documents from /proj/mte/data/corpus-lpsc/mer-pdf/
directory. 705 documents have been processed so far, and everything looks good. If the run with the new model completes without NullPointerExpcetion
, I will move it to /proj/mte/trained_models
. This new model was trained using the exact same input files and parameter settings as the model at /proj/mte/trained_models/jSRE-contains-msl-mpf-phx.model
, and the performance numbers are also exactly the same.
private void readParameters(UnZipModel model) throws IOException
{
logger.info("read parameters");
// get the param model
File paramFile = model.get("param");
parameter.load(new FileInputStream(paramFile));
}
@wkiri I've copied the new jSRE contains model to the /proj/mte/trained_models
directory:
/proj/mte/trained_models/jSRE-contains-msl-mpf-phx.model
Please let me know if you have problems using it.
@youlu Thank you, I am currently running lpsc_parser.py
on the MER-A docs with this model. It will take a little while, but it has already processed the document that generated an error above without problems. (As we expect :) )
This process is complete. Please see the MER-A JSON file using the new Contains model here:
/proj/mte/results/mer-a-jsre-v2-ads-gaz-C2HP.jsonl
and compare with content in
/proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP.jsonl
I am regenerating these outputs and will post here when they are ready.
The JSON files are now available as follows:
/proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP-redo.jsonl
/proj/mte/results/mer-a-jsre-v2-ads-gaz-C2HP-redo.jsonl
Please share the outcome of your comparison script when you have a chance (low priority).
@wkiri Thanks for re-generating the jsonl files. I re-ran the comparison script, and it is surprising that the Contains relations in the two jsonl files are exactly the same. Please see the output below from the comparison script:
JSONL file 1: mer-a-jsre-v2-ads-gaz-CHP-redo.jsonl
JSONL file 2: mer-a-jsre-v2-ads-gaz-C2HP-redo.jsonl
Relation types included for comparison: ["Contains"]
Total relations found in JSONL file 1: 4165
Total relations found in JSONL file 2: 4165
Total common relations found in both JSONL files: 4165
Unique relations found only in JSONL file 1: 0
Unique relations found only in JSONL file 2: 0
@stevenlujpl Wow, that does not seem possible. I have double-checked that I specified different jSRE models to each run, and the input jSRE files are quite different in size. I will think about it some more. Let's set this aside for now in favor of issue #9.
I think we can close this issue for now.
This model will supersede the current jSRE contains model, which was trained only on LPSC 2015 documents.