wikipathways / rWikiPathways

R package for WikiPathways API
https://r.wikipathways.org/
MIT License
15 stars 6 forks source link

getXrefList(...) method is broken #30

Open OliveIsLazy opened 1 year ago

OliveIsLazy commented 1 year ago

For a few months, I've struggled with getting this method to work (first issues relating to getting geneset names that later resolved itself, and now issues with getting the actual genes in the genesets), which is strange because the issues started after the last commit.

The code I use is:

# Get metadata.
pw <- listPathways("Mus musculus")

# Compile pathways.
gs <- list()
for (i in pw$id)
{
  print(i)
  gs[[length(gs)+1]] <- getXrefList(i, "L")
  print(gs[[length(gs)]])
}

This code used to work for me, but now it produces incomplete returns like this (I only copied the first few outputs):

[1] "WP1"
[1] ""
[1] "WP10"
 [1] "11651"  "12702"  "14784"  "16186"  "16198"  "16199"  "16367"  "16451"  "16453"  "18708"  "19247"  "20416"  "20846" 
[14] "20848"  "20850"  "20851"  "26395"  "26396"  "26413"  "26417"  "269523" "384783" "54721"  "81601" 
[1] "WP103"
[1] ""
[1] "WP108"
[1] ""
[1] "WP113"
[1] ""
[1] "WP116"
[1] ""
[1] "WP1241"
[1] ""
[1] "WP1242"
[1] ""
[1] "WP1243"
[1] ""
[1] "WP1244"
[1] ""

I was wondering if it had to do with calling bridgedb's datasources repo, as their datasources.tsv file has been moved from the path described in location in getXrefList(...)'s help page.

Please tell me if you need more information!

egonw commented 1 year ago

Hi @OliveIsLazy, we're running into a few issues ourselves. Thank you for our report. I hope we can fix things as soon as possible.

mkutmon commented 1 year ago

@OliveIsLazy as an alternative, you can download the mouse pathway collection in GMT format which has all the entrez gene ids for every approved mouse pathway. should also be a lot faster.

library(rWikiPathways)

wp.mm <- rWikiPathways::downloadPathwayArchive(date="20230210", organism="Mus musculus", format="gmt")
wp2gene <- readPathwayGMT(wp.mm)
OliveIsLazy commented 1 year ago

@mkutmon This method has helped! The dataframe had a lot of the information I needed. None of the pathways have no genes in them, though pathway WP4627 (" Lipids measured in liver metastasis from breast cancer") has only one gene it in. Here is the code block I used in case anyone else is interested.

#' Generate a list of lists, where each sublist is a WikiPathway genesets.
#' The sublist names correspond to the genesets' name.
#' The sublist values correspond to the genes in the geneset.
#' bug report: https://github.com/wikipathways/rWikiPathways/issues/30#issuecomment-1460013132
#' #' DATE: 08-03-2023
get_wikipathways_two <- function() {

  # Get metadata.
  wp.mm <- rWikiPathways::downloadPathwayArchive(date="20230210", organism="Mus musculus", format="gmt")
  wp2gene <- readPathwayGMT(wp.mm)

  # Set up key variables.
  pwid <- sort(unique(wp2gene$wpid))
  gs <- list()

  # For each pathway id, find the rows with a matching ID and save the associated genes in a list
  for (i in pwid)
  {
    gs[[length(gs)+1]] <- wp2gene[(wp2gene$wpid==i), "gene"]

    # Acts as a fail-safe in case there's an issue with the id-pathwayname mapping.
    names(gs)[length(gs)] <- unique(wp2gene[(wp2gene$wpid==i), "name"])
  }

  return(gs)
}

By the way, I checked the number of separate genesets in the function you sent me and it totals to 199, which is less than the amount I recorded from December 14th 2022, 238. And indeed, the function I posted initially still generate 238 genesets. I checked and the 199 current pathways appear to all be a subset of the previous 238. Do you know what happened to these missing pathways or how I could get access to them?

On that note, I'd also like to know if any of the pathway definitions (aka the genes in them) has changed since this date, if that's something you're able to tell me.

Thank you very much for your help and your quick response, it is much appreciated.

mkutmon commented 1 year ago

Ah, the GMT file only contains approved pathways. Pathways under development and not finished pathways are not included in our monthly releases. I will double-check if that is indeed the case for those 39 mouse pathways. Regarding the changes of the pathways > the GMT files are created monthly always on the 10th. So you simply change the date in the downloadPathwayArchive function and go back to see if anything changed.

Hope that helps!

OliveIsLazy commented 1 year ago

I tried messing around a little and I haven't found any changes Dec 2022 and Feb 2023, but I really appreciate being able to confirm that!

This next part doesn't really relate to the coding backend, so I apologise if it's a bit rude, but I went and checked for which pathways I was missing for my analysis I'm in need of these five pathways:

I hope to hear back soon on whether or not their gene lists can be made available again.

DeniseSl22 commented 1 year ago
  • "SRF and miRs in Smooth Muscle Differentiation and Proliferation"

WP2081 --> not Approved; Homology Converted

  • "Toll-like receptor signaling pathway"

WP1271 --> Not Approved; Homology Converted

  • "Adar1 editing defficiency immune response"

WP3415 --> Not Approved

  • "Elongation of (very) long chain fatty acids"

WP4491 --> Not Approved

  • "GPCRs, non-odorant"

WP1396 --> Approved, so this one should be available in the GMT download @mkutmon ...? I checked, and the GMT release of Feb. 2023 includes this PW ID, so not sure why it's not available in R?

And, how can someone make a GMT file for unapproved PWs? @mkutmon @egonw ? Maybe we should make that option available in the future?

mkutmon commented 1 year ago

WP1396 is also in the R table. Creating a GMT file for all pathways might be possible but it's likely not something we want to distribute since the pathways might not be ready for analysis yet. I restarted our indexer but the xreflist function still shows a bit of inconsistent behavior. I'll need to do more testing.

AlexanderPico commented 1 year ago

Agreed. GMTs serve a specific purpose for enrichment analysis. We really don't want to make or share GMTs with unapproved content.

OliveIsLazy commented 1 year ago

I can confirm that WP1396 is in the R table. I checked using the pathway names where the December 2022 version is called "GPCRs, non-odorant" and the February 2023 version is of the same name except with a space added at the end, hence the mismatch.

My apologies for the confusion.

egonw commented 1 year ago

Okay, we worked hard to solve a number of issues, including fixes in the BridgeDb Webservice, the ID mappings databases, uptime of the data.wikipathways.org server, and the indexer. I guess one of the last two was underlying your observation. @OliveIsLazy, can you confirm your original code is working again? I get:

> getXrefList("WP3415", "L")
 [1] "100038882" "100039796" "100702"    "106759"    "110454"    "12675"    
 [7] "142980"    "15945"     "15957"     "15959"     "15962"     "15975"    
[13] "15977"     "15979"     "16150"     "16151"     "17067"     "17858"    
[19] "18033"     "18035"     "19765"     "20304"     "20558"     "20684"    
[25] "21353"     "21822"     "22031"     "22034"     "223672"    "228607"   
[31] "230073"    "231655"    "23960"     "23961"     "23962"     "240327"   
[37] "240328"    "24110"     "246727"    "246728"    "246729"    "246730"   
[43] "26409"     "270151"    "320377"    "327959"    "434341"    "54123"    
[49] "54131"     "55932"     "56066"     "56417"     "56480"     "56489"    
[55] "58185"     "58203"     "60440"     "620913"    "623121"    "626578"   
[61] "66513"     "66724"     "667370"    "667373"    "67775"     "68652"    
[67] "71586"     "71898"     "74498"     "80861"     "99899"    

But I cannot say the results are very stable. It seems that ID mapping does not take place:

> getXrefList("WP2338", "En")
[1] "ENSG00000100697" "ENSG00000113360" "ENSG00000124571" "ENSG00000128191"
> getXrefList("WP2338", "L")
[1] "5901"

The returned Ensembl and NCBI identifiers here are only those that are explicitly in the pathways and none of the mapped identifiers. We see this problem also on BioConductor: https://bioconductor.org/checkResults/release/bioc-LATEST/rWikiPathways/lconway-checksrc.html

Basically, it seem the reboot of the Indexer solved some problems, but not all. The Indexer is not using the latest ID mapping file, which we hope to get from @tabbassidaloii soon.

mkutmon commented 1 year ago

I created an issue in the indexer repo: https://github.com/wikipathways/lucene-indexer/issues/4. It has nothing to do with the new mapping databases so I need more time to actually look at the code and check what is happening and why the ID mapping doesn't work anymore.

OliveIsLazy commented 1 year ago

I ran the code I originally posted, and all the genesets that I needed to match did based on the name (except the "GPCRs, non-odorant" but only due to the aforementioned naming difference). It generated 238 pathways, as it did in December. I tried checking the code again against another older file I had that saved 195 of the genesets' WikiPathway IDs, and all 195 genesets were accounted for.

As @egonw pointed out, there are issues with the ID mapping. 71 of the genesets have either no or one EntrezIDs in them. Here's two examples:

$`Amino acid conjugation of benzoic acid`
[1] "60525"

$Apoptosis
[1] ""

Thank you all very much for helping me, I really appreciate the hard work.

DeniseSl22 commented 1 year ago

@OliveIsLazy ; I'm assuming you are running your code for the mouse species, right?

@egonw and @mkutmon : seems like the mapping to other databases is (still) not happening. When I execute the regular xRef mapping API call in the BridgeDb webservice, I do get a repsonse:

curl -X GET "https://webservice.bridgedb.org/Mus%20musculus/xrefs/En/ENSMUSG00000000817?dataSource=L" -H "accept: */*"

14103   Entrez Gene

When I try out the batch mapping, I get no matches:

curl -X POST "https://webservice.bridgedb.org/Mouse/xrefsBatch/En?dataSource=L" -H "accept: */*" -H "Content-Type: text/html" -d "ENSMUSG00000000817ENSMUSG00000001729ENSMUSG00000003184ENSMUSG00000003873"

TypeError: Failed to fetch

The same happens when I use the batch mapping, but just input 1 ID.

OliveIsLazy commented 1 year ago

Yes, all the code I've ran has been only for the mice pathways.