Closed wkiri closed 2 years ago
@wkiri I read through the code in jsre_labeling_corenlp_and_brat.py
script. Is there a reason that the element and mineral example files are generated separately? From the instructions on the train jsre wiki page, the element and mineral example files are concatenated before calling the jsre Java program. I am wondering why not just generating the element and mineral examples together in one file (and skipping the concatenation step).
@stevenlujpl There is no longer a reason for doing this. At the time this code was written, we speculated that we might need 2 jSRE models, one for element
and one for mineral
. When I tested this vs. a combined model, the combined model was better, so we concatenated the examples. Going forward, we can just generate one example file that includes both sets of examples.
(We generated the separate example files so I could do the test mentioned above)
I see. Thanks for the clarification.
I am also not sure why do we need to use corenlp to process the .txt
files and extract NERs in real-time. I think the .ann
files contain everything (target, element, mineral, property, contains, and hasProperty) we need already.
The .ann
file has human-reviewed ("gold") content. CoreNLP gives the automated content which corresponds better to an operational setting (on new files). We might want to generate both versions and compare the resulting models - I'm not sure we ever did that test.
Never mind. I think I see why. We need negative examples for training jsre model.
Yes, but we could generate negative examples from the gold annotations (entities) as well.
Oh wait, the other reason is that we need features CoreNLP generates (part of speech etc.) for the jSRE feature vector.
Thanks. I think I know what to do.
@wkiri I updated the jsre_labeling_corenlp_and_brat.py
script to remove solr dependency. Please see below for the summary of the changes I made (in issue19 branch):
-n
or --ner_model
option. I tested the script to generate jsre example files for Contains relation. I haven't tested the script for HasProperty relations because I don't think we have NER models that can recognize Property entities for MPF and PHX docs.
@stevenlujpl Thank you! These updates all sound great.
Our latest MER-A model does recognize and annotate Property relations. You can find it in
/proj/mte/trained_models/ner_MERA-property-salient.ser.gz
@wkiri Thanks for the pointer to the MER-A NER model. I think we only have Property entities for MPF and PHX docs.
@stevenlujpl I see. I think you could just use this NER model. It was trained on the MPF/PHX docs to learn the Property type. The only weakness might be in terms of Targets because its gazette only includes MER-A targets, so I wonder if it would miss some training examples in the MPF/PHX docs, or treat them as negative examples when they should be positives. This makes me wonder if we should be using the NER model as the entity label source. As you've noted, the entity types are in the .ann files, so can we just use those to generate the candidates (and entity labels), and only use CoreNLP to get the POS and other features?
Next steps:
HasProperty
The MPF and PHX collections have some overlap in documents (although with different annotations). To generate the list of overlapping documents:
$ cd /proj/mte/results/
$ comm -12 <(ls mpf-reviewed+properties-v2/*txt | cut -f2 -d'/') <(ls phx-reviewed+properties-v2/*txt | cut -f2 -d'/') |less
This yields 33 documents:
2009_1329.txt
2009_1409.txt
2009_1420.txt
2009_1558.txt
2009_1799.txt
2009_1846.txt
2009_2082.txt
2009_2420.txt
2010_1416.txt
2010_1419.txt
2010_1879.txt
2010_2377.txt
2011_1122.txt
2011_1547.txt
2011_2529.txt
2012_1314.txt
2012_1507.txt
2012_2864.txt
2013_2168.txt
2013_2534.txt
2013_2606.txt
2013_2778.txt
2013_2797.txt
2013_2923.txt
2013_3095.txt
2014_1604.txt
2014_2866.txt
2014_2879.txt
2015_2572.txt
2015_2851.txt
2018_1488.txt
2019_1392.txt
2019_2593.txt
However, only 3 of the overlapping docs contain Targets (2009_1329
, 2013_2168
, and 2015_2572
). In 2 cases, the target appears as Target-PHX in the MPF docs, and in the other case it is Target-MSL. So if we are only using Target annotations to generate jSRE examples, then we should not be generating any duplicates even if we use all of the documents together.
@wkiri Thanks. We are only using the Target annotations to generate jSRE examples. I think we should be fine then, but I will double check these documents when I generate jSRE examples.
I've confirmed with the previous version of jsre_labeling_corenlp_and_brat.py
script that "Scooby Doo" is treated as two targets "Scooby" and "Doo".
@wkiri I updated the script to handle multi-words Target and Element/Mineral/Property entities. In the meeting today, we agreed to only handle multi-words Target entities. There are actually quite a lot of PHX Property entities that are multi-words, so I added code to handle these Property entities as well.
However, this caused a problem to the case in which a Property entity is a part of a Target entity. For example, entities T21 Target 8030 8038 Rosy Red
and T115 Property 8035 8038 Red
in /proj/mte/results/phx-reviewed+properties-v2/2009_1067.ann
. The Property entity Red
is a part of the Target entity Rosy Red
. When I split the Target entity Rosy Red
into two Target entities Rosy
and Red
. The entity Red
is both Target and Property at the same time. I will update the script to ignore cases like this. Please let me know if you can think of a better method to handle this problem.
@stevenlujpl Thanks! By "handle" multi-word Targets, do you mean "split into individual words and generate one example per word"?
After our meeting, I was noting that in the .ann files the Targets are not tokens but instead are multi-word expressions. If you are using them as a reference, you must be splitting them out. Maybe we should instead try to do this properly and train jSRE to work on multi-word expressions (like you originally suggested, e.g. with underscores), since we have them in the .ann files. We just need to ensure we can perform the same pre-processing when we apply the models in jsre_parser.py
.
For the Rosy Red
case, in which there is a Target and Property on Red
, I suggest letting Target supersede Property and treating it (Red
) as a Target only. Then you can still get training examples involving the second half of the Target Rosy Red
.
If you go to multi-word support directly, I still suggest omitting nested Property words that are inside a Target. (We probably shouldn't be generating them anyway...)
@wkiri Thanks for the suggestions. The underscore approach may complicate the deployment in jsre_parser.py, and it will require some structural changes to the jsre_labeling_corenlp_and_brat.py script I currently have. Removing nested Property words is easy. I suggest that we explore the underscore approach later if the multi-word approach doesn't work well.
By "handle" multi-word Targets, do you mean "split into individual words and generate one example per word"?
Yes.
Removing nested Property words is easy. I suggest that we explore the underscore approach later if the multi-word approach doesn't work well.
Sounds good!
@wkiri I have trained and evaluated a jSRE HasProperty model. The training examples, model file, jSRE prediction output file can be found in the following locations:
/home/youlu/MTE/working_dir/jsre_example/mpf-phx-hasproperty.train
/home/youlu/MTE/working_dir/jsre_example/mpf-phx-hasproperty.model
/home/youlu/MTE/working_dir/jsre_example/mpf-phx-hasproperty.output
Please see the performance of the jSRE HasProperty model on the training set below:
Accuracy = 96.37870855148341% (2209/2292) (classification)
Mean squared error = 0.036212914485165795 (regression)
Squared correlation coefficient = 0.8439440814796754 (regression)
c tp fp fn total prec recall F1
1 670 82 1 2292 0.891 0.999 0.942
For this run, I had to remove the examples from the following documents because they caused a NullPointerException to the jSRE's Train program. Some of the jSRE examples created from these 5 documents don't contain jSRE agent
. This is unexpected and I don't know what is going on yet. I will investigate this problem next week. Other than these 5 documents, the documents in /proj/mte/results/mpf-reviewed+properties-v2
and /proj/mte/results/phx-reviewed+properties-v2
were all included to create training examples. Out of 2292 training examples created, 1621 of them are negative examples and 671 are positive examples.
1998_1462
1998_1803
1998_1829
2003_1081
2015_2572
Great job, @stevenlujpl ! Let's leave this issue open until you are able to investigate the 5 documents with strange examples. Otherwise I think this issue is complete. I will create a separate issue for the updates to jsre_parser.py
to include HasProperty
in addition to Contains
(or the ability to select one relation at time?).
@wkiri The problem caused by the 5 documents mentioned in the post above has been resolved. I retrained a new jSRE HasProperty model, and please see the performance numbers below:
Accuracy = 96.16153538584567% (2405/2501) (classification)
Mean squared error = 0.03838464614154338 (regression)
Squared correlation coefficient = 0.8367933899350146 (regression)
c tp fp fn total prec recall F1
1 747 93 3 2501 0.889 0.996 0.940
There are 2 types of problems with the 5 documents. The first type of problem is caused by my ignorance about CoreNLP's tokenization for hyphenated words. I thought CoreNLP consistently splits all hyphenated words, but that is not true. CoreNLP will split most hyphenated words, but if the affixes defined in section 1.2 Hyphenated Words of Supplementary Guidelines for ETTB 2.0 document are inside hyphenated words, CoreNLP won't split them. In order to match the sub-tokens from hyphenated Brat annotations to the CoreNLP tokens, I implemented the same tokenization logic as CoreNLP.
The second type of problem is caused by the following annotation in MPF 1998_1803.ann file. CoreNLP will split the word Cradle-brYogi-ave
into tokens Cradle
, -
, brYogi
, -'
and ave
, but the target annotation was created only for the partial word Yogi
instead of the entire word brYogi
. We cannot match the Brat annotation Yogi
to CoreNLP token brYogi
, and the example created from Yogi
will be Other
instead of Agent
. I don't think there is a general solution to solve this problem and I don't want to include hardcoded logic, so I simply added a check to ignore examples like this. According to the log messages, this should be the only case in MPF and PHX reviewed v2 annotations.
@wkiri I've copied the jSRE HasProperty model trained on all MPF and PHX to /proj/mte/trained_models/jSRE-hasproperty-mpf-phx-reviewd-v2.model
. The v2
in the filename means that this model was trained on MPF and PHX reviewed data sets version 2.
@wkiri I've added the train_jsre.py
script. Please note that it is only in issue19
branch now. Please take a look and let me know if you have any suggestions. Thanks.
Please see below for all the arguments of the train_jsre.py
script. The required arguments are (1) in_train_file
for the input training examples in jSRE format and (2) out_model_file
for the trained jSRE model. The arguments -k
, -m
, -n
, -w
, and -c
are the hyperparameters for training a jSRE model. All the hyperparameters that are available for jSRE's Train Java program have been exposed in train_jsre.py
script. The argument -e
(or --evaluation
) is the flag to turn on/off evaluation. It is disabled by default. If it is enabled, train_jsre.py
script will call the jSRE's Predict Java program to make predictions on the training examples and print the performance measures to stdout. The argument -jr
(or --jsre_root
) is the path to the jSRE root directory, and the default is /proj/mte/jSRE/jsre-1.1/
.
(venv) [youlu@mlia-compute1 test_train_jsre]$ python ~/MTE/MTE/src/train_jsre.py -h
usage: train_jsre.py [-h] [-k {LC,GC,SL}] [-m MEMORY_SIZE] [-n N_GRAM]
[-w WINDOW_SIZE] [-c C] [-e] [-jr JSRE_ROOT]
in_train_file out_model_file
positional arguments:
in_train_file Path to the jSRE input file
out_model_file jSRE output model
optional arguments:
-h, --help show this help message and exit
-k {LC,GC,SL}, --kernel {LC,GC,SL}
Set type of kernel function. Available options are LC
(Local Context Kernel), GC (Global Context Kernel),
and SL (Shallow Linguistic Context Kernel). The
default is SL.
-m MEMORY_SIZE, --memory_size MEMORY_SIZE
Set cache memory size in MB. The default is 128MB.
-n N_GRAM, --n_gram N_GRAM
set the parameter n-gram of kernels SL and GC. The
default is 3.
-w WINDOW_SIZE, --window_size WINDOW_SIZE
set the window size of kernel LC. The default is 2.
-c C set the trade-off between training error and margin.
The default is 1.
-e, --evaluation If this option is enabled, the trained jSRE model will
be evaluated with the examples in the input file, and
the predictions will be stored in a text file in the
current working directory.This option is disabled by
default.
-jr JSRE_ROOT, --jsre_root JSRE_ROOT
Path to jSRE installation directory. Default is
/proj/mte/jSRE/jsre-1.1/
@stevenlujpl Thank you for doing this!
To make it more of an end-to-end script, like train_ner.py
, would it be possible to have it also call jsre_labeling_corenlp_and_brat.py
if needed to generate (and combine) the examples files? Perhaps the input argument can be like doc_file
for train_ner.py
which is a list of .txt and .ann files needed to generate the training input, if the examples file does not already exist. train_ner.py
assumes that the examples file it needs is the same as doc_file
but with a .tsv
extension. Perhaps train_jsre.py
could look for an equivalent doc_file.examples
file and generate it if it isn't there. I suggest directing the per-document .examples files into /tmp or other temporary space, and only keeping the concatenated examples file in the end.
@wkiri I've made train_jsre.py
an end-to-end script, and merged the change to the master
branch. Please see the usage below. There are a lot of arguments, but only 3 of them are required and the rest of the arguments are provided with reasonable default values. Please let me know if you have suggestions.
usage: train_jsre.py [-h] [-k {LC,GC,SL}] [-m MEMORY_SIZE] [-n N_GRAM]
[-w WINDOW_SIZE] [-c C] [-e] [-jr JSRE_ROOT]
[-jt JSRE_TMP_DIR] [--corenlp_url CORENLP_URL]
[--keep_jsre_examples]
in_dir {contains,hasproperty} out_dir
positional arguments:
in_dir Directory path to documents containing text (.txt) and
annotations (.ann)
{contains,hasproperty}
The valid options are contains or has_property
out_dir Path to output directory. The trained jSRE model and
the concatenated jSRE training example file will be
stored in the output directory.
optional arguments:
-h, --help show this help message and exit
-k {LC,GC,SL}, --kernel {LC,GC,SL}
Set type of kernel function. Available options are LC
(Local Context Kernel), GC (Global Context Kernel),
and SL (Shallow Linguistic Context Kernel). The
default is SL.
-m MEMORY_SIZE, --memory_size MEMORY_SIZE
Set cache memory size in MB. The default is 128MB.
-n N_GRAM, --n_gram N_GRAM
set the parameter n-gram of kernels SL and GC. The
default is 3.
-w WINDOW_SIZE, --window_size WINDOW_SIZE
set the window size of kernel LC. The default is 2.
-c C set the trade-off between training error and margin.
The default is 1.
-e, --evaluation If this option is enabled, the trained jSRE model will
be evaluated with the examples in the input file, and
the predictions will be stored in a text file in the
current working directory.This option is disabled by
default.
-jr JSRE_ROOT, --jsre_root JSRE_ROOT
Path to jSRE installation directory. Default is
/proj/mte/jSRE/jsre-1.1/
-jt JSRE_TMP_DIR, --jsre_tmp_dir JSRE_TMP_DIR
Path to a directory for jSRE to temporarily store
input and output files. Default is /tmp
--corenlp_url CORENLP_URL
URL of Stanford CoreNLP server. The default is
http://localhost:9000
--keep_jsre_examples If this option is enabled, the jSRE example files
generated in jsre_tmp_dir will not be deleted. This
option is by default disabled.
Todo: allow train_jsre.py
input directory to be a list of directories instead of just one
This is complete. If we later want to support multiple input directories, that can become a new issue.
jsre_parser.py
to apply more than one model (see #34 )