Closed AlexanderPico closed 2 years ago
The repo name we chose is https://www.github.com/wikipathways/wikipathways-database
Code to get AnalysisCollection pathways added or updated since yesterday:
import base64
from datetime import date
from pathlib import Path
today = date.today()
# timestamp for yesterday
timestamp = "".join(
[str(x) for x in [today.year, today.month, today.day - 1, "000000"]]
)
changes_url = (
"https://webservice.wikipathways.org/getRecentChanges?timestamp="
+ timestamp
+ "&format=json"
)
changes_r = requests.get(changes_url)
changes_result = changes_r.json()
for pathway in changes_result["pathways"]:
wpid = pathway["id"]
curation_tags_url = (
"https://webservice.wikipathways.org/getCurationTags?format=json&pwId="
+ wpid
)
curation_tags_r = requests.get(curation_tags_url)
curation_tags_result = curation_tags_r.json()
for tag in curation_tags_result["tags"]:
if tag["name"] == "Curation:AnalysisCollection":
gpml_url = (
"https://webservice.wikipathways.org/getPathwayAs?format=json&fileType=gpml&pwId="
+ wpid
)
gpml_r = requests.get(gpml_url)
gpml_result = gpml_r.json()
gpml = base64.b64decode(gpml_result["data"])
p = Path(wpid).with_suffix(".gpml")
p.write_bytes(gpml)
@mkutmon added the .tsv
and .info
files, e.g.:
https://github.com/wikipathways/wikipathways-database/tree/main/pathways/WP1
This means we have a GPML repo without history but with all analysis collection pathways (cleaned up GPMLs with ID and author info + .tsv structure data files).
Later, we'll want to edit this to exclude some of the current AnalysisCollection pathways, e.g., the pathways identified for exclusion in Kristina’s spreadsheet. This will approximately be the following:
AnalysisCollection - NeedsWork - Stub - (“keep” != “yes” in a new Homology spreadsheet) - (“keep” != “yes” in Kristina’s spreadsheet)
Sample GH Action for running Java 8: https://github.com/wikipathways/wikipathways-database/blob/main/.github/workflows/pathvisio.yml
Sample output from that action: https://github.com/wikipathways/wikipathways-database/runs/4590309688?check_suite_focus=true
@ariutta : Related to your comment above about excluding pathways based on the spreadsheet, by filtering the spreadsheet on "Analysis Tagged" and "keep", one can get a list to exclude from AnalysisCollection directly, so more like this:
AnalysisCollection - (“keep” != “yes” AND "analysis tagged" = "yes" in Kristina’s spreadsheet)
So only those NeedsWork and Stub pathways that are tagged with Analysis and labeled as delete/unclear are relevant. Other pathways in the spreadsheet are just tagged as NeedsWork or Stub and not currently in Analysis.
The spreadsheet was getting confusing, so I have made a new tab with a smaller subset. This set represents only the overlap between Analysis collection and Stub/Needs Work. It DOES NOT include all pathways tagged as Stub/Needs Work.
The spreadsheet includes information on curation tasks, if it's a Homology pathway etc. Columns E-H contains my assessment of the status of the pathway and whether it should be included in the Analysis collection for the new site.
How to use:
To find the list of WPIDs to REMOVE from the current Analysis collection, filter on column H ("keep") for anything other than "yes". This will work right now, but will exclude some pathways that could be included after some curation (keep reading)
To find pathways worthy of inclusion after some curation, filter on column H for "pending curation". Feel free to signup for any of these by adding your name in column F.
To find WPIDs that I wasn't sure about, filter on column H for "unclear". These are mostly of the same type, yeast pathways that have labels instead of nodes, unconnected interactions, and missing metabolites. I think we should probably remove them.
I hope this is clearer than before.
Very helpful! There were only 53 rows, so I was able to manually review each one. Everything has a "yes" or "no" answer now.
For cases where the answer is "no" I simply removed the "Analysis" tag in the current database as well (and updated column I to "no"). This way our "keep" list can simply be derived from the "Analysis" tag as we transition from the current site to the new one.
@khanspers Do we also need a tab for the intersection of "homology" and "analysis" in order to remove some of those as well?
Yes, we need to do the same (or something similar) for homology. Was going to start on that next.
@AlexanderPico @ariutta : Thanks Alex for doing the final review of the spreadsheet! Question: Should I also go ahead and untag the ones we have decided to remove at the current site? Or does that not matter going forward, once the new database has been populated once?
I did that already for all the ones I marked 'no.' I think we should keep our tagging on the current site updated in this way so that we can simply use it directly to inform the "keep" set. This way our two sites will be in sync during the (potentially very long) period where we have both running at the same time.
I added a sheet for Homology:
New issue for homology converted content: https://github.com/wikipathways/wikipathways-development/issues/36
Three options: