Populating and updating GPML database

AlexanderPico commented 2 years ago

Three options:

Initialize database with recreation of the history of GPML changes. Make script to update GPMLs (nightly/weekly), potentially including Author and revision information.
Initialize database with recreation of the history of GPML changes. Make script to update GPMLs (nightly/weekly), without any other information.
Initialize database with the latest GPML (with no or minimal other info). Make script to update GPMLs (nightly/weekly), without any other information.

ariutta commented 2 years ago

The repo name we chose is https://www.github.com/wikipathways/wikipathways-database

ariutta commented 2 years ago

Code to get AnalysisCollection pathways added or updated since yesterday:

import base64
from datetime import date
from pathlib import Path

today = date.today()
# timestamp for yesterday
timestamp = "".join(
    [str(x) for x in [today.year, today.month, today.day - 1, "000000"]]
)
changes_url = (
    "https://webservice.wikipathways.org/getRecentChanges?timestamp="
    + timestamp
    + "&format=json"
)
changes_r = requests.get(changes_url)
changes_result = changes_r.json()

for pathway in changes_result["pathways"]:
    wpid = pathway["id"]
    curation_tags_url = (
        "https://webservice.wikipathways.org/getCurationTags?format=json&pwId="
        + wpid
    )
    curation_tags_r = requests.get(curation_tags_url)
    curation_tags_result = curation_tags_r.json()
    for tag in curation_tags_result["tags"]:
        if tag["name"] == "Curation:AnalysisCollection":
            gpml_url = (
                "https://webservice.wikipathways.org/getPathwayAs?format=json&fileType=gpml&pwId="
                + wpid
            )
            gpml_r = requests.get(gpml_url)
            gpml_result = gpml_r.json()
            gpml = base64.b64decode(gpml_result["data"])
            p = Path(wpid).with_suffix(".gpml")
            p.write_bytes(gpml)

ariutta commented 2 years ago

@mkutmon added the .tsv and .info files, e.g.: https://github.com/wikipathways/wikipathways-database/tree/main/pathways/WP1

This means we have a GPML repo without history but with all analysis collection pathways (cleaned up GPMLs with ID and author info + .tsv structure data files).

Later, we'll want to edit this to exclude some of the current AnalysisCollection pathways, e.g., the pathways identified for exclusion in Kristina’s spreadsheet. This will approximately be the following:

AnalysisCollection - NeedsWork - Stub - (“keep” != “yes” in a new Homology spreadsheet) - (“keep” != “yes” in Kristina’s spreadsheet)

ariutta commented 2 years ago

Sample GH Action for running Java 8: https://github.com/wikipathways/wikipathways-database/blob/main/.github/workflows/pathvisio.yml

Sample output from that action: https://github.com/wikipathways/wikipathways-database/runs/4590309688?check_suite_focus=true

khanspers commented 2 years ago

@ariutta : Related to your comment above about excluding pathways based on the spreadsheet, by filtering the spreadsheet on "Analysis Tagged" and "keep", one can get a list to exclude from AnalysisCollection directly, so more like this:

AnalysisCollection  -  (“keep” != “yes” AND "analysis tagged" = "yes" in Kristina’s spreadsheet)

So only those NeedsWork and Stub pathways that are tagged with Analysis and labeled as delete/unclear are relevant. Other pathways in the spreadsheet are just tagged as NeedsWork or Stub and not currently in Analysis.

khanspers commented 2 years ago

The spreadsheet was getting confusing, so I have made a new tab with a smaller subset. This set represents only the overlap between Analysis collection and Stub/Needs Work. It DOES NOT include all pathways tagged as Stub/Needs Work.

The spreadsheet includes information on curation tasks, if it's a Homology pathway etc. Columns E-H contains my assessment of the status of the pathway and whether it should be included in the Analysis collection for the new site.

https://docs.google.com/spreadsheets/d/1DHX_FbSwmeTxOXp8ar5BmGKQtOSYriUTtzYajjAlg8c/edit#gid=471432093

How to use:

To find the list of WPIDs to REMOVE from the current Analysis collection, filter on column H ("keep") for anything other than "yes". This will work right now, but will exclude some pathways that could be included after some curation (keep reading)
To find pathways worthy of inclusion after some curation, filter on column H for "pending curation". Feel free to signup for any of these by adding your name in column F.
To find WPIDs that I wasn't sure about, filter on column H for "unclear". These are mostly of the same type, yeast pathways that have labels instead of nodes, unconnected interactions, and missing metabolites. I think we should probably remove them.

I hope this is clearer than before.

AlexanderPico commented 2 years ago

Very helpful! There were only 53 rows, so I was able to manually review each one. Everything has a "yes" or "no" answer now.

For cases where the answer is "no" I simply removed the "Analysis" tag in the current database as well (and updated column I to "no"). This way our "keep" list can simply be derived from the "Analysis" tag as we transition from the current site to the new one.

AlexanderPico commented 2 years ago

@khanspers Do we also need a tab for the intersection of "homology" and "analysis" in order to remove some of those as well?

khanspers commented 2 years ago

Yes, we need to do the same (or something similar) for homology. Was going to start on that next.

khanspers commented 2 years ago

@AlexanderPico @ariutta : Thanks Alex for doing the final review of the spreadsheet! Question: Should I also go ahead and untag the ones we have decided to remove at the current site? Or does that not matter going forward, once the new database has been populated once?

AlexanderPico commented 2 years ago

I did that already for all the ones I marked 'no.' I think we should keep our tagging on the current site updated in this way so that we can simply use it directly to inform the "keep" set. This way our two sites will be in sync during the (potentially very long) period where we have both running at the same time.

khanspers commented 2 years ago

I added a sheet for Homology:

This is based on the "getEveryCurationTag" method, so it includes all pathways tagged as Homology.
The last two columns show the revision number for the tagged revision and the latest, to deduce which pathways have been updated since it was last tagged (19 of 152)
I have not done any manual checking of these yet, since its not clear what we want to do with the information. https://docs.google.com/spreadsheets/d/1DHX_FbSwmeTxOXp8ar5BmGKQtOSYriUTtzYajjAlg8c/edit#gid=632216275

New issue for homology converted content: https://github.com/wikipathways/wikipathways-development/issues/36

wikipathways / wikipathways-development

Populating and updating GPML database #24