wkiri / MTE

Mars Target Encyclopedia
Apache License 2.0
5 stars 0 forks source link

Train jSRE model for "has_property" relation #19

Closed wkiri closed 2 years ago

wkiri commented 2 years ago
stevenlujpl commented 2 years ago

@wkiri I read through the code in jsre_labeling_corenlp_and_brat.py script. Is there a reason that the element and mineral example files are generated separately? From the instructions on the train jsre wiki page, the element and mineral example files are concatenated before calling the jsre Java program. I am wondering why not just generating the element and mineral examples together in one file (and skipping the concatenation step).

wkiri commented 2 years ago

@stevenlujpl There is no longer a reason for doing this. At the time this code was written, we speculated that we might need 2 jSRE models, one for element and one for mineral. When I tested this vs. a combined model, the combined model was better, so we concatenated the examples. Going forward, we can just generate one example file that includes both sets of examples.

wkiri commented 2 years ago

(We generated the separate example files so I could do the test mentioned above)

stevenlujpl commented 2 years ago

I see. Thanks for the clarification.

I am also not sure why do we need to use corenlp to process the .txt files and extract NERs in real-time. I think the .ann files contain everything (target, element, mineral, property, contains, and hasProperty) we need already.

wkiri commented 2 years ago

The .ann file has human-reviewed ("gold") content. CoreNLP gives the automated content which corresponds better to an operational setting (on new files). We might want to generate both versions and compare the resulting models - I'm not sure we ever did that test.

stevenlujpl commented 2 years ago

Never mind. I think I see why. We need negative examples for training jsre model.

wkiri commented 2 years ago

Yes, but we could generate negative examples from the gold annotations (entities) as well.

wkiri commented 2 years ago

Oh wait, the other reason is that we need features CoreNLP generates (part of speech etc.) for the jSRE feature vector.

stevenlujpl commented 2 years ago

Thanks. I think I know what to do.

stevenlujpl commented 2 years ago

@wkiri I updated the jsre_labeling_corenlp_and_brat.py script to remove solr dependency. Please see below for the summary of the changes I made (in issue19 branch):

  1. Remove solr dependency. Now, the Contains and HasProperty relations are determined using brat ann files.
  2. Instead of separately generating Element and Mineral example files, we are generating one example file for Element and Mineral.
  3. Add the ability to generate HasProperty examples.
  4. Updated the corenlp properties dictionary to include necessary attributes for journal papers.
  5. Exposed the NER model as a command line argument. Previously, the NER model is hard-coded in the corenlp properties dictionary. Now, we can specify the NER model using -n or --ner_model option.

I tested the script to generate jsre example files for Contains relation. I haven't tested the script for HasProperty relations because I don't think we have NER models that can recognize Property entities for MPF and PHX docs.

wkiri commented 2 years ago

@stevenlujpl Thank you! These updates all sound great.

Our latest MER-A model does recognize and annotate Property relations. You can find it in /proj/mte/trained_models/ner_MERA-property-salient.ser.gz

stevenlujpl commented 2 years ago

@wkiri Thanks for the pointer to the MER-A NER model. I think we only have Property entities for MPF and PHX docs.

wkiri commented 2 years ago

@stevenlujpl I see. I think you could just use this NER model. It was trained on the MPF/PHX docs to learn the Property type. The only weakness might be in terms of Targets because its gazette only includes MER-A targets, so I wonder if it would miss some training examples in the MPF/PHX docs, or treat them as negative examples when they should be positives. This makes me wonder if we should be using the NER model as the entity label source. As you've noted, the entity types are in the .ann files, so can we just use those to generate the candidates (and entity labels), and only use CoreNLP to get the POS and other features?

wkiri commented 2 years ago

Next steps:

wkiri commented 2 years ago

The MPF and PHX collections have some overlap in documents (although with different annotations). To generate the list of overlapping documents:

$ cd /proj/mte/results/
$ comm -12 <(ls mpf-reviewed+properties-v2/*txt | cut -f2 -d'/') <(ls phx-reviewed+properties-v2/*txt | cut -f2 -d'/') |less

This yields 33 documents:

2009_1329.txt
2009_1409.txt
2009_1420.txt
2009_1558.txt
2009_1799.txt
2009_1846.txt
2009_2082.txt
2009_2420.txt
2010_1416.txt
2010_1419.txt
2010_1879.txt
2010_2377.txt
2011_1122.txt
2011_1547.txt
2011_2529.txt
2012_1314.txt
2012_1507.txt
2012_2864.txt
2013_2168.txt
2013_2534.txt
2013_2606.txt
2013_2778.txt
2013_2797.txt
2013_2923.txt
2013_3095.txt
2014_1604.txt
2014_2866.txt
2014_2879.txt
2015_2572.txt
2015_2851.txt
2018_1488.txt
2019_1392.txt
2019_2593.txt

However, only 3 of the overlapping docs contain Targets (2009_1329, 2013_2168, and 2015_2572). In 2 cases, the target appears as Target-PHX in the MPF docs, and in the other case it is Target-MSL. So if we are only using Target annotations to generate jSRE examples, then we should not be generating any duplicates even if we use all of the documents together.

stevenlujpl commented 2 years ago

@wkiri Thanks. We are only using the Target annotations to generate jSRE examples. I think we should be fine then, but I will double check these documents when I generate jSRE examples.

I've confirmed with the previous version of jsre_labeling_corenlp_and_brat.py script that "Scooby Doo" is treated as two targets "Scooby" and "Doo".

stevenlujpl commented 2 years ago

@wkiri I updated the script to handle multi-words Target and Element/Mineral/Property entities. In the meeting today, we agreed to only handle multi-words Target entities. There are actually quite a lot of PHX Property entities that are multi-words, so I added code to handle these Property entities as well.

However, this caused a problem to the case in which a Property entity is a part of a Target entity. For example, entities T21 Target 8030 8038 Rosy Red and T115 Property 8035 8038 Red in /proj/mte/results/phx-reviewed+properties-v2/2009_1067.ann. The Property entity Red is a part of the Target entity Rosy Red. When I split the Target entity Rosy Red into two Target entities Rosy and Red. The entity Red is both Target and Property at the same time. I will update the script to ignore cases like this. Please let me know if you can think of a better method to handle this problem.

wkiri commented 2 years ago

@stevenlujpl Thanks! By "handle" multi-word Targets, do you mean "split into individual words and generate one example per word"?

After our meeting, I was noting that in the .ann files the Targets are not tokens but instead are multi-word expressions. If you are using them as a reference, you must be splitting them out. Maybe we should instead try to do this properly and train jSRE to work on multi-word expressions (like you originally suggested, e.g. with underscores), since we have them in the .ann files. We just need to ensure we can perform the same pre-processing when we apply the models in jsre_parser.py.

For the Rosy Red case, in which there is a Target and Property on Red, I suggest letting Target supersede Property and treating it (Red) as a Target only. Then you can still get training examples involving the second half of the Target Rosy Red.

wkiri commented 2 years ago

If you go to multi-word support directly, I still suggest omitting nested Property words that are inside a Target. (We probably shouldn't be generating them anyway...)

stevenlujpl commented 2 years ago

@wkiri Thanks for the suggestions. The underscore approach may complicate the deployment in jsre_parser.py, and it will require some structural changes to the jsre_labeling_corenlp_and_brat.py script I currently have. Removing nested Property words is easy. I suggest that we explore the underscore approach later if the multi-word approach doesn't work well.

stevenlujpl commented 2 years ago

By "handle" multi-word Targets, do you mean "split into individual words and generate one example per word"?

Yes.

wkiri commented 2 years ago

Removing nested Property words is easy. I suggest that we explore the underscore approach later if the multi-word approach doesn't work well.

Sounds good!

stevenlujpl commented 2 years ago

@wkiri I have trained and evaluated a jSRE HasProperty model. The training examples, model file, jSRE prediction output file can be found in the following locations:

Please see the performance of the jSRE HasProperty model on the training set below:

Accuracy = 96.37870855148341% (2209/2292) (classification)
Mean squared error = 0.036212914485165795 (regression)
Squared correlation coefficient = 0.8439440814796754 (regression)
c   tp  fp  fn  total   prec    recall  F1
1   670 82  1   2292    0.891   0.999   0.942

For this run, I had to remove the examples from the following documents because they caused a NullPointerException to the jSRE's Train program. Some of the jSRE examples created from these 5 documents don't contain jSRE agent. This is unexpected and I don't know what is going on yet. I will investigate this problem next week. Other than these 5 documents, the documents in /proj/mte/results/mpf-reviewed+properties-v2 and /proj/mte/results/phx-reviewed+properties-v2 were all included to create training examples. Out of 2292 training examples created, 1621 of them are negative examples and 671 are positive examples.

1998_1462
1998_1803
1998_1829
2003_1081
2015_2572
wkiri commented 2 years ago

Great job, @stevenlujpl ! Let's leave this issue open until you are able to investigate the 5 documents with strange examples. Otherwise I think this issue is complete. I will create a separate issue for the updates to jsre_parser.py to include HasProperty in addition to Contains (or the ability to select one relation at time?).

stevenlujpl commented 2 years ago

@wkiri The problem caused by the 5 documents mentioned in the post above has been resolved. I retrained a new jSRE HasProperty model, and please see the performance numbers below:

Accuracy = 96.16153538584567% (2405/2501) (classification)
Mean squared error = 0.03838464614154338 (regression)
Squared correlation coefficient = 0.8367933899350146 (regression)
c   tp  fp  fn  total   prec    recall  F1
1   747 93  3   2501    0.889   0.996   0.940

There are 2 types of problems with the 5 documents. The first type of problem is caused by my ignorance about CoreNLP's tokenization for hyphenated words. I thought CoreNLP consistently splits all hyphenated words, but that is not true. CoreNLP will split most hyphenated words, but if the affixes defined in section 1.2 Hyphenated Words of Supplementary Guidelines for ETTB 2.0 document are inside hyphenated words, CoreNLP won't split them. In order to match the sub-tokens from hyphenated Brat annotations to the CoreNLP tokens, I implemented the same tokenization logic as CoreNLP.

The second type of problem is caused by the following annotation in MPF 1998_1803.ann file. CoreNLP will split the word Cradle-brYogi-ave into tokens Cradle, -, brYogi, -' and ave, but the target annotation was created only for the partial word Yogi instead of the entire word brYogi. We cannot match the Brat annotation Yogi to CoreNLP token brYogi, and the example created from Yogi will be Other instead of Agent. I don't think there is a general solution to solve this problem and I don't want to include hardcoded logic, so I simply added a check to ignore examples like this. According to the log messages, this should be the only case in MPF and PHX reviewed v2 annotations.

Screen Shot 2021-10-12 at 8 21 25 PM
stevenlujpl commented 2 years ago

@wkiri I've copied the jSRE HasProperty model trained on all MPF and PHX to /proj/mte/trained_models/jSRE-hasproperty-mpf-phx-reviewd-v2.model. The v2 in the filename means that this model was trained on MPF and PHX reviewed data sets version 2.

stevenlujpl commented 2 years ago

@wkiri I've added the train_jsre.py script. Please note that it is only in issue19 branch now. Please take a look and let me know if you have any suggestions. Thanks.

Please see below for all the arguments of the train_jsre.py script. The required arguments are (1) in_train_file for the input training examples in jSRE format and (2) out_model_file for the trained jSRE model. The arguments -k, -m, -n, -w, and -c are the hyperparameters for training a jSRE model. All the hyperparameters that are available for jSRE's Train Java program have been exposed in train_jsre.py script. The argument -e (or --evaluation) is the flag to turn on/off evaluation. It is disabled by default. If it is enabled, train_jsre.py script will call the jSRE's Predict Java program to make predictions on the training examples and print the performance measures to stdout. The argument -jr (or --jsre_root) is the path to the jSRE root directory, and the default is /proj/mte/jSRE/jsre-1.1/.

(venv) [youlu@mlia-compute1 test_train_jsre]$ python ~/MTE/MTE/src/train_jsre.py -h
usage: train_jsre.py [-h] [-k {LC,GC,SL}] [-m MEMORY_SIZE] [-n N_GRAM]
                     [-w WINDOW_SIZE] [-c C] [-e] [-jr JSRE_ROOT]
                     in_train_file out_model_file

positional arguments:
  in_train_file         Path to the jSRE input file
  out_model_file        jSRE output model

optional arguments:
  -h, --help            show this help message and exit
  -k {LC,GC,SL}, --kernel {LC,GC,SL}
                        Set type of kernel function. Available options are LC
                        (Local Context Kernel), GC (Global Context Kernel),
                        and SL (Shallow Linguistic Context Kernel). The
                        default is SL.
  -m MEMORY_SIZE, --memory_size MEMORY_SIZE
                        Set cache memory size in MB. The default is 128MB.
  -n N_GRAM, --n_gram N_GRAM
                        set the parameter n-gram of kernels SL and GC. The
                        default is 3.
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        set the window size of kernel LC. The default is 2.
  -c C                  set the trade-off between training error and margin.
                        The default is 1.
  -e, --evaluation      If this option is enabled, the trained jSRE model will
                        be evaluated with the examples in the input file, and
                        the predictions will be stored in a text file in the
                        current working directory.This option is disabled by
                        default.
  -jr JSRE_ROOT, --jsre_root JSRE_ROOT
                        Path to jSRE installation directory. Default is
                        /proj/mte/jSRE/jsre-1.1/
wkiri commented 2 years ago

@stevenlujpl Thank you for doing this!

To make it more of an end-to-end script, like train_ner.py, would it be possible to have it also call jsre_labeling_corenlp_and_brat.py if needed to generate (and combine) the examples files? Perhaps the input argument can be like doc_file for train_ner.py which is a list of .txt and .ann files needed to generate the training input, if the examples file does not already exist. train_ner.py assumes that the examples file it needs is the same as doc_file but with a .tsv extension. Perhaps train_jsre.py could look for an equivalent doc_file.examples file and generate it if it isn't there. I suggest directing the per-document .examples files into /tmp or other temporary space, and only keeping the concatenated examples file in the end.

stevenlujpl commented 2 years ago

@wkiri I've made train_jsre.py an end-to-end script, and merged the change to the master branch. Please see the usage below. There are a lot of arguments, but only 3 of them are required and the rest of the arguments are provided with reasonable default values. Please let me know if you have suggestions.

usage: train_jsre.py [-h] [-k {LC,GC,SL}] [-m MEMORY_SIZE] [-n N_GRAM]
                     [-w WINDOW_SIZE] [-c C] [-e] [-jr JSRE_ROOT]
                     [-jt JSRE_TMP_DIR] [--corenlp_url CORENLP_URL]
                     [--keep_jsre_examples]
                     in_dir {contains,hasproperty} out_dir

positional arguments:
  in_dir                Directory path to documents containing text (.txt) and
                        annotations (.ann)
  {contains,hasproperty}
                        The valid options are contains or has_property
  out_dir               Path to output directory. The trained jSRE model and
                        the concatenated jSRE training example file will be
                        stored in the output directory.

optional arguments:
  -h, --help            show this help message and exit
  -k {LC,GC,SL}, --kernel {LC,GC,SL}
                        Set type of kernel function. Available options are LC
                        (Local Context Kernel), GC (Global Context Kernel),
                        and SL (Shallow Linguistic Context Kernel). The
                        default is SL.
  -m MEMORY_SIZE, --memory_size MEMORY_SIZE
                        Set cache memory size in MB. The default is 128MB.
  -n N_GRAM, --n_gram N_GRAM
                        set the parameter n-gram of kernels SL and GC. The
                        default is 3.
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        set the window size of kernel LC. The default is 2.
  -c C                  set the trade-off between training error and margin.
                        The default is 1.
  -e, --evaluation      If this option is enabled, the trained jSRE model will
                        be evaluated with the examples in the input file, and
                        the predictions will be stored in a text file in the
                        current working directory.This option is disabled by
                        default.
  -jr JSRE_ROOT, --jsre_root JSRE_ROOT
                        Path to jSRE installation directory. Default is
                        /proj/mte/jSRE/jsre-1.1/
  -jt JSRE_TMP_DIR, --jsre_tmp_dir JSRE_TMP_DIR
                        Path to a directory for jSRE to temporarily store
                        input and output files. Default is /tmp
  --corenlp_url CORENLP_URL
                        URL of Stanford CoreNLP server. The default is
                        http://localhost:9000
  --keep_jsre_examples  If this option is enabled, the jSRE example files
                        generated in jsre_tmp_dir will not be deleted. This
                        option is by default disabled.
wkiri commented 2 years ago

Todo: allow train_jsre.py input directory to be a list of directories instead of just one

wkiri commented 2 years ago

This is complete. If we later want to support multiple input directories, that can become a new issue.