Closed wkiri closed 2 years ago
Current output (with Kiri's triage but prior to expert full review): https://ml.jpl.nasa.gov/mte/brat/#/mer-a/all-jsre-gazette
Each of our 3 expert reviewers reviewed 10 documents:
Findings:
I completed my quick pass through the 1303 MER-A documents to remove spurious Target
annotations and reduce the reviewing effort that will be required. The documents are at:
https://ml.jpl.nasa.gov/mte/brat/#/mer-a/all-jsre-gazette
Targets (unique) | Contains | Docs with at least one Target | Docs with at least one Contains relation | |
---|---|---|---|---|
Salient target NER + gazette | 3393 (313) | 3320 | 583 | 225 |
After removing spurious Targets | 2944 (278) | 2958 | 390 | 200 |
Our goal is now to review the 200 documents with at least one Contains relation. Using our estimate of 10 minutes as the average time to review one document, this works out to 33 hours of reviewing, or about 11 hours per person with 3 reviewers.
Note: after adding the HasProperty
relations, it is possible that the set to review will increase.
I've combined the Contains
and HasProperty
relations (from jSRE) manually to make them simultaneously reviewable for 7 documents:
https://ml.jpl.nasa.gov/mte/brat/#/mer-a/mer-a-jsre-contains+hasproperty-10withrel/
We will use these as examples to converge on standard review guidelines for the rest of the MER corpus.
Reviewing Guidelines are here and will be refined in conversation with Leslie, Matt, and Raymond: https://docs.google.com/document/d/1KnkVtxkKb9kcRVKZqPWwuXSi_6zK8Scay1M6qiz1Khc/edit#
I have reviewed the first 30 (of 397) MER-A documents. It took an average of 1.77 minutes per document so far (including the time taken to fix the gazettes when I realized that Anatolia is a MER-B target, not MER-A). I am compiling questions for expert review here (so far just one case): https://docs.google.com/spreadsheets/d/1yLmoy8IrTn_I38J3DGQDs0ScQ1jxkfCGaipR8WMwxhs/edit#gid=0
@wkiri I added the template files for MERA and MERB. Please see the following example bundle generated with MERA template files. Please note this example bundle was generated using the MPF Sqlite DB file, so please ignore the CSV files in the bundle and only check on the contents of the XML label files. Please let me know what you think. mera_bundle.zip
If we run the PDS validate tool on this bundle, there will be two warnings about not having XML label files for the manifest and md5 checksum files. I think we can safely ignore them.
...
PASS: file:/home/youlu/MTE/working_dir/test_mer_bundle/mera_bundle/mars_target_encyclopedia/urn-nasa-pds-mars_target_encyclopedia.manifest
WARNING [warning.file.not_referenced_in_label] File is not referenced by any label
3 integrity check(s) completed
PASS: file:/home/youlu/MTE/working_dir/test_mer_bundle/mera_bundle/mars_target_encyclopedia/urn-nasa-pds-mars_target_encyclopedia.md5
WARNING [warning.file.not_referenced_in_label] File is not referenced by any label
4 integrity check(s) completed
...
@stevenlujpl I am having some trouble generating an SQLite database for MER-A documents. The ingestion step did not generate any errors. When I try to combine the brat .ann (reviewed) files with the JSON file, I get an error:
Traceback (most recent call last):
File "./update_sqlite.py", line 131, in <module>
main(**vars(args))
File "./update_sqlite.py", line 64, in main
mte_db.update_brat_doc(brat_doc)
File "/home/wkiri/Research/MTE/git/src/sqlite_mte.py", line 503, in update_brat_doc
self.insert_brat_doc(brat_doc)
File "/home/wkiri/Research/MTE/git/src/sqlite_mte.py", line 359, in insert_brat_doc
(brat_ann.ann_id, brat_doc.doc_id, e)
RuntimeError: Insertion error for target T59 in doc 2005_1202: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
Here is the command I used:
$ export MISSION=mer-a
$ export VERSION=v1.0.0
$ export DB_FILE=/proj/mte/sqlite/mte_${MISSION}_all_$VERSION.db
$ export ANN_DIR=$MISSION-reviewed+properties-v2
$ export REVIEWER=Kiri
$ ./update_sqlite.py -r $REVIEWER -ro /proj/mte/results/$ANN_DIR $DB_FILE $MISSION > update-DB-$MISSION-$VERSION.log
You can find the document it is complaining about in /proj/mte/results/$ANN_DIR/2005_1202.ann
. I do not see any UTF-8 content in this file and I am not sure what the text_factory error means. It is possible that we have some UTF-8 content in some files, so it does seem good to support this in general if need be.
As a minor note, update_sqlite.py
has a restriction on what you can specify for the "mission" argument. Do you think we need this restriction? It will not run with MISSION set to "mer-a" unless I add it to the mission choices. But I do not see any conditional processing based on the mission. I think it just gets stored in the DB file. If so, it seems fine to remove the restriction to particular choices. However, if there is some processing that only works for the named missions, we should keep it (and expand to support mer-a and mer-b). Please let me know your thoughts.
As an update, I have now reviewed the first 100 MER-A documents, and Matt provided answers to my review questions to date. Average time taken is 3.7 minutes/document.
Not exactly sure why this issue was closed, but I will re-open it because we are not done with it yet.
(I think the issue was automatically closed because I merged the branch issue27
into master
).
@wkiri The problem related to update_sqlite.py
script should have been resolved. The insertion error was caused by inserting the following sentence into the database.
Identification and Characteristics of Gusev Olivine: Pancam, Mini-TES, and Mössbauer spectra obtained by the Spirit Rover all confirm the presence of abundant olivine in Gusev plains basalts (Humphrey, Adirondack, and Mazatzal) [11].
The sentence was stored as a byte string in our scripts, but the database expects unicode strings. I decoded the sentence using utf-8 encoding. The problem seems to be resolved.
@wkiri I've merged the changes I made in issue27
branch into the master
branch. Now, master
contains all the changes including the change I just made to fix update_sqlite.py
. Thanks.
@youlu Thank you for finding and fixing the string issue! Interesting, I think because the commit message had the word "fix" in it, github decided that counts as closing the issue :)
I again tried to generate a MER-A (mer2) database. I think I discovered another issue in the update process. I am getting this error:
RuntimeError: Insertion error for target T32 in doc 2015_2881: There are duplicated sentences in the sentences table: [2] Ashley J.
The check for duplicate sentences checks only the sentence content. If the same sentence appears in more than one document (which is what happened here - the same sentence is in document 2008_2382), then the addition fails:
However, I think the query for duplicates should check both sentence and document id, so that it does not think these are duplicate sentences. What do you think?
@wkiri I've added doc_id
to the duplicated sentence check. This problem should be resolved. I didn't consider this situation as I didn't think that there would be two sentences that are exactly the same in two documents. Please let me know if you run into other problems. Thanks.
@stevenlujpl It certainly seems like an unlikely occurrence, doesn't it? :)
Thank you for the update! I've generated a DB for mer2 that you can use for dev/testing of a mer2
website:
/proj/mte/sqlite/mte_mer2_all_v1.0.0.db
When the reviewing process is complete, I will update this file.
Note: Our current solution in name_utils.py
to map aliases/typos to canonical target names will not work as-is since it does not differentiate between missions. PHX has a target "Baby Bear" that is abbreviated as "BB", and MER-A has a target "Breadbox" that is also abbreviated to "BB". We will need separate dictionaries per mission, or in general, transition to including an "aliases" table in the schema (per mission) as in issue #9.
@stevenlujpl I attempted to generate a JSON file for the already-reviewed MER-A documents. However, I am getting an error:
[2021-12-01 14:17:18]: LPSC parser failed: /proj/mte/data/corpus-lpsc/mer-pdf/2004_1770.pdf
[2021-12-01 14:17:18]:
Traceback (most recent call last):
File "../../git/src/lpsc_parser.py", line 154, in process
ads_dict['metadata'])
File "../../git/src/lpsc_parser.py", line 51, in parse
paper_dict = super(LpscParser, self).parse(text, metadata)
File "/home/wkiri/Research/MTE/git/src/paper_parser.py", line 32, in parse
assert type(text) == str or type(text) == unicode
AssertionError
Here is the command I used (only inputting one file for testing):
$ export JSON_FILE=/proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP-first190.jsonl
$ export NER_MODEL=/proj/mte/trained_models/ner_MERA-property-salient.ser.gz
$ export GAZETTE=../../git/ref/MERA_salient_targets_minerals-2017-05_elements.gaz.txt
$ python ../../git/src/lpsc_parser.py -i /proj/mte/data/corpus-lpsc/mer-pdf/2004_1770.pdf -o $JSON_FILE -jr /proj/mte/jSRE/jsre-1.1 -n $NER_MODEL -g $GAZETTE -rt Contains HasProperty -jm /proj/mte/trained_models/jSRE-{lpsc15-merged-binary,hasproperty-mpf-phx-reviewed-v2}.model -l mer-a-contains-hasproperty-first190.log
Unfortunately I don't have time to look more deeply right now. If you get a chance before I do, please let me know what you find.
This appears to be caused by an error with Tika, which sets the document content to None
. I will keep investigating.
@stevenlujpl I think we need to restart the Tika server. I see it running on mlia-compute1 under your username:
youlu 1536 1 0 Nov16 ? 00:09:15 java -cp /tmp/tika-server.jar org.apache.tika.server.TikaServerCli --port 9998 --host localhost
Could you kill this process and restart it as mteuser
? (I think we restart simply by running Python and import tika
; is that right?)
@wkiri I have terminated the tika server process running under my username. I don't think we need to explicitly start tika server. It will be automatically started when we run any of the parser scripts (e.g., jsre_parser.py, lpsc_parser.py, etc.).
@stevenlujpl Thank you! It is working now. I will let you know when the new database is available.
Awkwardly, the Tika process is now under my username. I wish there were a way to restart it for all users, not just the last person who caused the server to start.
The new database is now available at https://ml.jpl.nasa.gov/mte/mer2/
By the way, if you search for "Adirondack" and scroll down to "iron", it seems that "iron" is highlighted as a substring inside the name "Adirondack" :) It would be nice to update the matching logic to avoid that case :)
I've reviewed 260/397 documents now; hoping to finish this up this week if possible.
Great news! I finished reviewing all 397 documents. Of these, 317 contain at least one MER-A target. There are 3357 Target mentions in all. I'll generate a database using these annotations and provide additional statistics, then move on to generating a bundle and exercising the alias capability.
I reviewed the content in the MTE DB and found some additional minor fixes needed for the annotations. I'm re-generating the database, which is now:
/proj/mte/sqlite/mte_mer2_all_v1.3.1.db
One remaining issue is that the documents
table does not have any information in the affiliations
column. @stevenlujpl Could you check into this? If it is not provided by ADS, I think we should be falling back on grobid, which extracts it from the PDF (and we'd need to review). I am not sure yet why it is blank.
@stevenlujpl I started grobid
on mlia-compute1
and re-generated the JSON file, then the SQLite database. The affiliations
field is still empty. Do you think there is another cause?
JSON file: /proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP-all397-withgrobid.jsonl
SQLite database: /proj/mte/sqlite/mte_mer2_all_v1.3.2.db
(It's easier/faster to test with a single PDF - even in that case, the grobid fields are not populated.)
The grobid fields are not available in the jsonl file for some reason.
There must be something I don't understand yet. I tried running lpsc_parser.py twice. The first time the affiliation field is empty, and the second time it worked.
For the first time, I logged into mlia-compute1, and I ran lpsc_parser.py without re-starting any service, the JSONL file was generated without any error but the grobid fields are all unavailable.
For the second time, I restarted the parser server and repeated exactly what I did for the first time, and it worked fine. Please see JSONL and DB files in /home/youlu/MTE/working_dir/investigate_affiliations
.
@wkiri Please let me know if there is still a problem.
@stevenlujpl Ok, I will try it again!
I tested on one document, and the grobid fields are now showing up. I will run on the entire collection now for MER-A.
I have documented the quality/sanity checks I'm performing on the DB here: https://github-fn.jpl.nasa.gov/wkiri/mte/wiki/Generate-and-deliver-MTE-PDS4-bundle-to-PDS
It works! The database now has affiliations in it for MER-A (mer2).
Interestingly, I went back and checked and found that our PHX database has affiliations, but the MPF database does not. I've included this (manual) check in the procedure above so we have a better chance of catching this if it happens again. (In addition to having lpsc_parser.py
check for all needed services and terminate if some are not available.)
Note: two interesting aliases came up where we did not have the canonical target name already in the targets
table: B.hardy
-> Bennett_Hardy
and Fredericbrown
-> Frederic_Brown
.
To ensure that the canonical name from aliases
maps to an entry in targets
, I added these canonical names to MERA-targets-final.txt
so they will appear in the targets
table. This means that a re-run of lpsc_parser.py
could potentially generate different NER/jSRE output. However, for MER-A we actually use ref/MER/MERA-targets-final-salient.gaz.txt
as our gazette, which I have not modified, so the results should not change.
I am now working on generating the updated databases for MPF and PHX. For MPF, we need to re-generate the JSON file so it will contain affiliations. When I ran this, 5 documents gave a warning such as:
/home/wkiri/Research/MTE/git/src/ads_parser.py:142: UserWarning: [WARNING] Failed accessing ADS database. The HTTP code is 504. The query string is year:1998 AND page:1137 AND pub:"Lunar and Planetary Science Conference" (response.status_code, query_str))
And one document gave this warning:
/home/wkiri/Research/MTE/git/src/ads_parser.py:149: UserWarning: [Warning] 0 document found in the ADS database warnings.warn('[Warning] 0 document found in the ADS database')
We can work around ADS failures by manually entering the info for those documents, but I wanted to check with @stevenlujpl first because the 5 failures with code 504 (gateway timeout) are surprising to me. Have you seen this before?
@wkiri I have not seen 504 before. Based on the definition of code 504 that I found online, I think the problem most likely is on the ADS side. There might be an ADS internal service timed out during our request. If this is just one-time error, then I wouldn't worry too much. However, if it is repeated error, then I think I should investigate it more. Could you please let me know that filenames for the 5 documents that gave 504 warning.
The HyperText Transfer Protocol (HTTP) 504 Gateway Timeout server error response code indicates that the server, while acting as a gateway or proxy, did not get a response in time from the upstream server that it needed in order to complete the request.
The 0 document found in the ADS database
warning is expected when the ADS query returns nothing.
Here are the documents that gave the 504 error:
(all in /proj/mte/data/corpus-lpsc/mpf-pdf/
)
1998_1137.pdf 1998_1444.pdf 2000_2104.pdf 2001_1259.pdf
I just ran again on 1998_1137.pdf
and got no error, so it seems it was probably transient (or maybe we were sending too many requests back to back?). I can try re-running the entire collection, but if we get any failures we still have gaps in our data set. As I noted, we can manually fix these, but it might be worth thinking about more robust solutions such as waiting 1 second between ADS queries and/or re-trying at least twice before moving on to the next document.
And for completeness, these two gave the "0 documents" result:
2006_2424.pdf 2003_1088.pdf
(It seems it is 4 and 2 documents in each category, not 5 and 1 as I reported before.)
I ran the same process again for the MPF documents and this time there are no 504 messages.
These documents are not found, so I will add their information manually:
2003_1088.pdf 2006_2424.pdf
Amusingly, neither of these documents had any MPF mentions so they do not propagate to the final DB :)
14 of the 67 MPF documents do not have the affiliations field populated (I guess too hard for grobid to make a guess). I reviewed the rest of the affiliations that were populated, and they are unfortunately not very high quality, maybe because different author teams use different styles. I think making this field look good would take significant manual effort.
What do you think of removing the affiliations field entirely? I am not sure it adds a lot of value. Users don't generally use this information to cite work, and if they want to check, they can access the original PDF via the doc_url
field. This would mean a change to our schema.
I have also generated the PHX database. 4 of its 36 documents have missing affiliations.
33 of the 315 MER-A (mer2) documents have missing affiliations.
I've generated the complete bundle and shared it here:
/proj/mte/pds-deliveries/bundle_v1.3.0/
This bundle passes the validate tool:
$ /proj/mte/pds4_validation_tool/bin/validate -R pds4.bundle /proj/mte/pds-deliveries/bundle_v1.3.0/
Please take a look and see if you are happy with delivering this bundle.
Also, once I generated this bundle I was reminded that we are already not putting the affiliations field in our final tables! So we can ignore my question about about omitting it.
After more thought, I think it makes sense to use regular versioning per file. If the contents have changed from our last delivery, we will advance the version to 1.3. Otherwise, we will leave them unchanged (1.2). The new files (alias tables and all MER-2 content) will individually be version 1.0.
(I am working on this myself)
Please note that this approach requires that the template files be manually updated for each release, because if the file they describe does NOT change, then we do not want this entry at the beginning of the <Modification_History>
section:
$today 1.3
But if the file they describe DOES change, we do want this.
However, I guess we already had to update the template files for each release anyway, to add the new version number and information. So this is not a change in effort, but it does mean each template file should be checked carefully for correct versioning.
I've updated the bundle in:
/proj/mte/pds-deliveries/bundle_v1.3.0/
and checked it with validate
.
@stevenlujpl Please take a look and let me know if you approve this bundle for delivery. If you can look at it today, that would be great!