wkiri / MTE

Mars Target Encyclopedia
Apache License 2.0
5 stars 0 forks source link

Generate MER-A PDS4 bundle #27

Closed wkiri closed 2 years ago

wkiri commented 2 years ago
wkiri commented 2 years ago

Current output (with Kiri's triage but prior to expert full review): https://ml.jpl.nasa.gov/mte/brat/#/mer-a/all-jsre-gazette

wkiri commented 2 years ago

Each of our 3 expert reviewers reviewed 10 documents:

Findings:

annot_times

annot_times_by_rels

wkiri commented 2 years ago

I completed my quick pass through the 1303 MER-A documents to remove spurious Target annotations and reduce the reviewing effort that will be required. The documents are at: https://ml.jpl.nasa.gov/mte/brat/#/mer-a/all-jsre-gazette

Targets (unique) Contains Docs with at least one Target Docs with at least one Contains relation
Salient target NER + gazette 3393 (313) 3320 583 225
After removing spurious Targets 2944 (278) 2958 390 200

Our goal is now to review the 200 documents with at least one Contains relation. Using our estimate of 10 minutes as the average time to review one document, this works out to 33 hours of reviewing, or about 11 hours per person with 3 reviewers.

wkiri commented 2 years ago

Note: after adding the HasProperty relations, it is possible that the set to review will increase.

wkiri commented 2 years ago

I've combined the Contains and HasProperty relations (from jSRE) manually to make them simultaneously reviewable for 7 documents: https://ml.jpl.nasa.gov/mte/brat/#/mer-a/mer-a-jsre-contains+hasproperty-10withrel/ We will use these as examples to converge on standard review guidelines for the rest of the MER corpus.

wkiri commented 2 years ago

Reviewing Guidelines are here and will be refined in conversation with Leslie, Matt, and Raymond: https://docs.google.com/document/d/1KnkVtxkKb9kcRVKZqPWwuXSi_6zK8Scay1M6qiz1Khc/edit#

wkiri commented 2 years ago

I have reviewed the first 30 (of 397) MER-A documents. It took an average of 1.77 minutes per document so far (including the time taken to fix the gazettes when I realized that Anatolia is a MER-B target, not MER-A). I am compiling questions for expert review here (so far just one case): https://docs.google.com/spreadsheets/d/1yLmoy8IrTn_I38J3DGQDs0ScQ1jxkfCGaipR8WMwxhs/edit#gid=0

stevenlujpl commented 2 years ago

@wkiri I added the template files for MERA and MERB. Please see the following example bundle generated with MERA template files. Please note this example bundle was generated using the MPF Sqlite DB file, so please ignore the CSV files in the bundle and only check on the contents of the XML label files. Please let me know what you think. mera_bundle.zip

If we run the PDS validate tool on this bundle, there will be two warnings about not having XML label files for the manifest and md5 checksum files. I think we can safely ignore them.

...
PASS: file:/home/youlu/MTE/working_dir/test_mer_bundle/mera_bundle/mars_target_encyclopedia/urn-nasa-pds-mars_target_encyclopedia.manifest
      WARNING  [warning.file.not_referenced_in_label]   File is not referenced by any label
        3 integrity check(s) completed

PASS: file:/home/youlu/MTE/working_dir/test_mer_bundle/mera_bundle/mars_target_encyclopedia/urn-nasa-pds-mars_target_encyclopedia.md5
      WARNING  [warning.file.not_referenced_in_label]   File is not referenced by any label
        4 integrity check(s) completed
...
wkiri commented 2 years ago

@stevenlujpl I am having some trouble generating an SQLite database for MER-A documents. The ingestion step did not generate any errors. When I try to combine the brat .ann (reviewed) files with the JSON file, I get an error:

Traceback (most recent call last):
  File "./update_sqlite.py", line 131, in <module>
    main(**vars(args))
  File "./update_sqlite.py", line 64, in main
    mte_db.update_brat_doc(brat_doc)
  File "/home/wkiri/Research/MTE/git/src/sqlite_mte.py", line 503, in update_brat_doc
    self.insert_brat_doc(brat_doc)
  File "/home/wkiri/Research/MTE/git/src/sqlite_mte.py", line 359, in insert_brat_doc
    (brat_ann.ann_id, brat_doc.doc_id, e)
RuntimeError: Insertion error for target T59 in doc 2005_1202: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

Here is the command I used:

$ export MISSION=mer-a
$ export VERSION=v1.0.0
$ export DB_FILE=/proj/mte/sqlite/mte_${MISSION}_all_$VERSION.db
$ export ANN_DIR=$MISSION-reviewed+properties-v2
$ export REVIEWER=Kiri
$ ./update_sqlite.py -r $REVIEWER -ro /proj/mte/results/$ANN_DIR $DB_FILE $MISSION > update-DB-$MISSION-$VERSION.log

You can find the document it is complaining about in /proj/mte/results/$ANN_DIR/2005_1202.ann. I do not see any UTF-8 content in this file and I am not sure what the text_factory error means. It is possible that we have some UTF-8 content in some files, so it does seem good to support this in general if need be.

As a minor note, update_sqlite.py has a restriction on what you can specify for the "mission" argument. Do you think we need this restriction? It will not run with MISSION set to "mer-a" unless I add it to the mission choices. But I do not see any conditional processing based on the mission. I think it just gets stored in the DB file. If so, it seems fine to remove the restriction to particular choices. However, if there is some processing that only works for the named missions, we should keep it (and expand to support mer-a and mer-b). Please let me know your thoughts.

wkiri commented 2 years ago

As an update, I have now reviewed the first 100 MER-A documents, and Matt provided answers to my review questions to date. Average time taken is 3.7 minutes/document.

stevenlujpl commented 2 years ago

Not exactly sure why this issue was closed, but I will re-open it because we are not done with it yet.

(I think the issue was automatically closed because I merged the branch issue27 into master).

stevenlujpl commented 2 years ago

@wkiri The problem related to update_sqlite.py script should have been resolved. The insertion error was caused by inserting the following sentence into the database.

Identification and Characteristics of Gusev Olivine:  Pancam, Mini-TES, and Mössbauer spectra obtained by the Spirit Rover all confirm the presence of abundant olivine in Gusev plains basalts (Humphrey, Adirondack, and Mazatzal) [11].

The sentence was stored as a byte string in our scripts, but the database expects unicode strings. I decoded the sentence using utf-8 encoding. The problem seems to be resolved.

stevenlujpl commented 2 years ago

@wkiri I've merged the changes I made in issue27 branch into the master branch. Now, master contains all the changes including the change I just made to fix update_sqlite.py. Thanks.

wkiri commented 2 years ago

@youlu Thank you for finding and fixing the string issue! Interesting, I think because the commit message had the word "fix" in it, github decided that counts as closing the issue :)

I again tried to generate a MER-A (mer2) database. I think I discovered another issue in the update process. I am getting this error:

RuntimeError: Insertion error for target T32 in doc 2015_2881: There are duplicated sentences in the sentences table: [2] Ashley J.

The check for duplicate sentences checks only the sentence content. If the same sentence appears in more than one document (which is what happened here - the same sentence is in document 2008_2382), then the addition fails:

https://github.com/wkiri/MTE/blob/441859f7df21a13ae20a50fecf0786c83e7f2ffb/src/sqlite_mte.py#L206-L219

However, I think the query for duplicates should check both sentence and document id, so that it does not think these are duplicate sentences. What do you think?

stevenlujpl commented 2 years ago

@wkiri I've added doc_id to the duplicated sentence check. This problem should be resolved. I didn't consider this situation as I didn't think that there would be two sentences that are exactly the same in two documents. Please let me know if you run into other problems. Thanks.

wkiri commented 2 years ago

@stevenlujpl It certainly seems like an unlikely occurrence, doesn't it? :)

Thank you for the update! I've generated a DB for mer2 that you can use for dev/testing of a mer2 website: /proj/mte/sqlite/mte_mer2_all_v1.0.0.db

When the reviewing process is complete, I will update this file.

wkiri commented 2 years ago

Note: Our current solution in name_utils.py to map aliases/typos to canonical target names will not work as-is since it does not differentiate between missions. PHX has a target "Baby Bear" that is abbreviated as "BB", and MER-A has a target "Breadbox" that is also abbreviated to "BB". We will need separate dictionaries per mission, or in general, transition to including an "aliases" table in the schema (per mission) as in issue #9.

wkiri commented 2 years ago

@stevenlujpl I attempted to generate a JSON file for the already-reviewed MER-A documents. However, I am getting an error:

[2021-12-01 14:17:18]: LPSC parser failed: /proj/mte/data/corpus-lpsc/mer-pdf/2004_1770.pdf
[2021-12-01 14:17:18]: 
Traceback (most recent call last):
  File "../../git/src/lpsc_parser.py", line 154, in process
    ads_dict['metadata'])
  File "../../git/src/lpsc_parser.py", line 51, in parse
    paper_dict = super(LpscParser, self).parse(text, metadata)
  File "/home/wkiri/Research/MTE/git/src/paper_parser.py", line 32, in parse
    assert type(text) == str or type(text) == unicode
AssertionError

Here is the command I used (only inputting one file for testing):

$ export JSON_FILE=/proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP-first190.jsonl
$ export NER_MODEL=/proj/mte/trained_models/ner_MERA-property-salient.ser.gz
$ export GAZETTE=../../git/ref/MERA_salient_targets_minerals-2017-05_elements.gaz.txt
$ python ../../git/src/lpsc_parser.py -i /proj/mte/data/corpus-lpsc/mer-pdf/2004_1770.pdf -o $JSON_FILE -jr /proj/mte/jSRE/jsre-1.1 -n $NER_MODEL -g $GAZETTE -rt Contains HasProperty -jm /proj/mte/trained_models/jSRE-{lpsc15-merged-binary,hasproperty-mpf-phx-reviewed-v2}.model -l mer-a-contains-hasproperty-first190.log

Unfortunately I don't have time to look more deeply right now. If you get a chance before I do, please let me know what you find.

wkiri commented 2 years ago

This appears to be caused by an error with Tika, which sets the document content to None. I will keep investigating.

wkiri commented 2 years ago

@stevenlujpl I think we need to restart the Tika server. I see it running on mlia-compute1 under your username: youlu 1536 1 0 Nov16 ? 00:09:15 java -cp /tmp/tika-server.jar org.apache.tika.server.TikaServerCli --port 9998 --host localhost

Could you kill this process and restart it as mteuser? (I think we restart simply by running Python and import tika; is that right?)

stevenlujpl commented 2 years ago

@wkiri I have terminated the tika server process running under my username. I don't think we need to explicitly start tika server. It will be automatically started when we run any of the parser scripts (e.g., jsre_parser.py, lpsc_parser.py, etc.).

wkiri commented 2 years ago

@stevenlujpl Thank you! It is working now. I will let you know when the new database is available.

Awkwardly, the Tika process is now under my username. I wish there were a way to restart it for all users, not just the last person who caused the server to start.

wkiri commented 2 years ago

The new database is now available at https://ml.jpl.nasa.gov/mte/mer2/

By the way, if you search for "Adirondack" and scroll down to "iron", it seems that "iron" is highlighted as a substring inside the name "Adirondack" :) It would be nice to update the matching logic to avoid that case :)

wkiri commented 2 years ago

I've reviewed 260/397 documents now; hoping to finish this up this week if possible.

wkiri commented 2 years ago

Great news! I finished reviewing all 397 documents. Of these, 317 contain at least one MER-A target. There are 3357 Target mentions in all. I'll generate a database using these annotations and provide additional statistics, then move on to generating a bundle and exercising the alias capability.

wkiri commented 2 years ago

I reviewed the content in the MTE DB and found some additional minor fixes needed for the annotations. I'm re-generating the database, which is now: /proj/mte/sqlite/mte_mer2_all_v1.3.1.db

One remaining issue is that the documents table does not have any information in the affiliations column. @stevenlujpl Could you check into this? If it is not provided by ADS, I think we should be falling back on grobid, which extracts it from the PDF (and we'd need to review). I am not sure yet why it is blank.

wkiri commented 2 years ago

@stevenlujpl I started grobid on mlia-compute1 and re-generated the JSON file, then the SQLite database. The affiliations field is still empty. Do you think there is another cause?

JSON file: /proj/mte/results/mer-a-jsre-v2-ads-gaz-CHP-all397-withgrobid.jsonl SQLite database: /proj/mte/sqlite/mte_mer2_all_v1.3.2.db

(It's easier/faster to test with a single PDF - even in that case, the grobid fields are not populated.)

stevenlujpl commented 2 years ago

The grobid fields are not available in the jsonl file for some reason.

stevenlujpl commented 2 years ago

There must be something I don't understand yet. I tried running lpsc_parser.py twice. The first time the affiliation field is empty, and the second time it worked.

For the first time, I logged into mlia-compute1, and I ran lpsc_parser.py without re-starting any service, the JSONL file was generated without any error but the grobid fields are all unavailable.

For the second time, I restarted the parser server and repeated exactly what I did for the first time, and it worked fine. Please see JSONL and DB files in /home/youlu/MTE/working_dir/investigate_affiliations.

stevenlujpl commented 2 years ago

@wkiri Please let me know if there is still a problem.

wkiri commented 2 years ago

@stevenlujpl Ok, I will try it again!

wkiri commented 2 years ago

I tested on one document, and the grobid fields are now showing up. I will run on the entire collection now for MER-A.

wkiri commented 2 years ago

I have documented the quality/sanity checks I'm performing on the DB here: https://github-fn.jpl.nasa.gov/wkiri/mte/wiki/Generate-and-deliver-MTE-PDS4-bundle-to-PDS

wkiri commented 2 years ago

It works! The database now has affiliations in it for MER-A (mer2).

Interestingly, I went back and checked and found that our PHX database has affiliations, but the MPF database does not. I've included this (manual) check in the procedure above so we have a better chance of catching this if it happens again. (In addition to having lpsc_parser.py check for all needed services and terminate if some are not available.)

wkiri commented 2 years ago

Note: two interesting aliases came up where we did not have the canonical target name already in the targets table: B.hardy -> Bennett_Hardy and Fredericbrown -> Frederic_Brown.

To ensure that the canonical name from aliases maps to an entry in targets, I added these canonical names to MERA-targets-final.txt so they will appear in the targets table. This means that a re-run of lpsc_parser.py could potentially generate different NER/jSRE output. However, for MER-A we actually use ref/MER/MERA-targets-final-salient.gaz.txt as our gazette, which I have not modified, so the results should not change.

wkiri commented 2 years ago

I am now working on generating the updated databases for MPF and PHX. For MPF, we need to re-generate the JSON file so it will contain affiliations. When I ran this, 5 documents gave a warning such as:

/home/wkiri/Research/MTE/git/src/ads_parser.py:142: UserWarning: [WARNING] Failed accessing ADS database. The HTTP code is 504. The query string is year:1998 AND page:1137 AND pub:"Lunar and Planetary Science Conference" (response.status_code, query_str))

And one document gave this warning:

/home/wkiri/Research/MTE/git/src/ads_parser.py:149: UserWarning: [Warning] 0 document found in the ADS database warnings.warn('[Warning] 0 document found in the ADS database')

We can work around ADS failures by manually entering the info for those documents, but I wanted to check with @stevenlujpl first because the 5 failures with code 504 (gateway timeout) are surprising to me. Have you seen this before?

stevenlujpl commented 2 years ago

@wkiri I have not seen 504 before. Based on the definition of code 504 that I found online, I think the problem most likely is on the ADS side. There might be an ADS internal service timed out during our request. If this is just one-time error, then I wouldn't worry too much. However, if it is repeated error, then I think I should investigate it more. Could you please let me know that filenames for the 5 documents that gave 504 warning.

The HyperText Transfer Protocol (HTTP) 504 Gateway Timeout server error response code indicates that the server, while acting as a gateway or proxy, did not get a response in time from the upstream server that it needed in order to complete the request.

The 0 document found in the ADS database warning is expected when the ADS query returns nothing.

wkiri commented 2 years ago

Here are the documents that gave the 504 error: (all in /proj/mte/data/corpus-lpsc/mpf-pdf/)

1998_1137.pdf 1998_1444.pdf 2000_2104.pdf 2001_1259.pdf

I just ran again on 1998_1137.pdf and got no error, so it seems it was probably transient (or maybe we were sending too many requests back to back?). I can try re-running the entire collection, but if we get any failures we still have gaps in our data set. As I noted, we can manually fix these, but it might be worth thinking about more robust solutions such as waiting 1 second between ADS queries and/or re-trying at least twice before moving on to the next document.

And for completeness, these two gave the "0 documents" result:

2006_2424.pdf 2003_1088.pdf

(It seems it is 4 and 2 documents in each category, not 5 and 1 as I reported before.)

wkiri commented 2 years ago

I ran the same process again for the MPF documents and this time there are no 504 messages.

These documents are not found, so I will add their information manually:

2003_1088.pdf 2006_2424.pdf

wkiri commented 2 years ago

Amusingly, neither of these documents had any MPF mentions so they do not propagate to the final DB :)

wkiri commented 2 years ago

14 of the 67 MPF documents do not have the affiliations field populated (I guess too hard for grobid to make a guess). I reviewed the rest of the affiliations that were populated, and they are unfortunately not very high quality, maybe because different author teams use different styles. I think making this field look good would take significant manual effort.

What do you think of removing the affiliations field entirely? I am not sure it adds a lot of value. Users don't generally use this information to cite work, and if they want to check, they can access the original PDF via the doc_url field. This would mean a change to our schema.

wkiri commented 2 years ago

I have also generated the PHX database. 4 of its 36 documents have missing affiliations.

33 of the 315 MER-A (mer2) documents have missing affiliations.

wkiri commented 2 years ago

I've generated the complete bundle and shared it here: /proj/mte/pds-deliveries/bundle_v1.3.0/ This bundle passes the validate tool:

$ /proj/mte/pds4_validation_tool/bin/validate -R pds4.bundle /proj/mte/pds-deliveries/bundle_v1.3.0/

Please take a look and see if you are happy with delivering this bundle.

Also, once I generated this bundle I was reminded that we are already not putting the affiliations field in our final tables! So we can ignore my question about about omitting it.

wkiri commented 2 years ago

After more thought, I think it makes sense to use regular versioning per file. If the contents have changed from our last delivery, we will advance the version to 1.3. Otherwise, we will leave them unchanged (1.2). The new files (alias tables and all MER-2 content) will individually be version 1.0.

wkiri commented 2 years ago

(I am working on this myself)

wkiri commented 2 years ago

Please note that this approach requires that the template files be manually updated for each release, because if the file they describe does NOT change, then we do not want this entry at the beginning of the <Modification_History> section:

$today 1.3

But if the file they describe DOES change, we do want this.

However, I guess we already had to update the template files for each release anyway, to add the new version number and information. So this is not a change in effort, but it does mean each template file should be checked carefully for correct versioning.

wkiri commented 2 years ago

I've updated the bundle in: /proj/mte/pds-deliveries/bundle_v1.3.0/ and checked it with validate. @stevenlujpl Please take a look and let me know if you approve this bundle for delivery. If you can look at it today, that would be great!