Establish a BioThings API for access to PFOCR data

AlexanderPico commented 4 years ago

Create an API compliant with the SmartAPI standard using the BioThings SDK.

andrewsu commented 4 years ago

First step here would be to create a JSON file file with some canonical gene ID (Entrez Gene or Ensembl) as the top-level key, and then an array of objects that corresponds to pathways that gene is a member of. Any reasonable object structure is probably fine, but would be good to confirm that with @kevinxin90 before you lock it in.

@kevinxin90: if we want two endpoints in the API -- one to query by gene ID and one to query by pathway ID -- do we need two separate JSON files? Or can that be handled as part of the BioThings API creation?

kevinxin90 commented 4 years ago

Here is a sample JSON file: {'_id': 'PMC5395363', 'associatetWith': { "genes": [1000, 207, 208, 51384], "additionalData": "..." } }

For all BioThings APIs, we need a “_id” for each document serving as the primary key. In your case, it might be the PMC ID related to the gene-linked pathway figures.

That “_id” has to be unique. In case you have multiple records for the same PMC ID, you could structure your output JSON as below:

{'_id': 'PMC5395363', 'associatetWith': [ { "genes": [1000, 207, 208, 51384], "additionalData": "..." }, { "genes": [1033, 84], "additionalData": "..." } }

The file you sent us could be a list (each element is a JSON document) or a text file with each line representing a JSON document.

andrewsu commented 4 years ago

In an email back in October 2019, I also asked

would it be possible to capture the link directly to the image? e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5729535/figure/fig1/

to which Alex replied

Yes, we can generate links directly to the image using this URL pattern: https://www.ncbi.nlm.nih.gov/pmc/articles/<PMCID>/bin/

I think whatever reasonable way you propose of adding that information to the JSON document would be fine.

ariutta commented 4 years ago

@kevinxin90 and @andrewsu, how would you prefer that we include information on which figure the genes are mentioned in? For example, PMC3955956 has a Figure 6 with filename nihms531061f6.jpg.

ariutta commented 4 years ago

I think whatever reasonable way you propose of adding that information to the JSON document would be fine.

Do we still want to group the results by PMC ID?

kevinxin90 commented 4 years ago

I assume one PMC ID corresponds to one figure, right? If that's the case, we could just structure the JSON output as below: {'_id': 'PMC5395363', 'associatedWith': { "genes": [1000, 207, 208, 51384], "figure": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5395363/figure/nihm531061f6.jpg/" } }

ariutta commented 4 years ago

One PMC ID corresponds to one paper, so there can be multiple ~~PMC IDs per figure~~ figures per PMC ID. I exported an initial draft of the results as this table. You can preview a sample in the last table in this notebook.

kevinxin90 commented 4 years ago

@ariutta I'm a little bit confused here. Could you clarify why would a figure correspond to multiple papers? If a figure might be related to multiple PMC IDs, and each PMC ID is related to all the genes in that figure, we could then structure the JSON output as below: {'_id': 'pcbi.1000512.g004', 'associatedWith': {"genes": [1000, 207, 208, 51384], "pmc": ["PMC2735650", "PMC2735651"], "figure_url": ["https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735650/bin/pcbi.1000512.g004.jpg", "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2735651/bin/pcbi.1000512.g004.jpg"]}}.

kevinxin90 commented 4 years ago

@ariutta Hi Anders, wanna follow up with you if there are any updates on this issue? Thanks!

ariutta commented 4 years ago

You're right, I reversed them! It should be multiple figures per paper.

ariutta commented 4 years ago

How about this format?

{"_id": "PMC5395363__nihm531061f6",
 "associatedWith": {
  "genes": ["1000", "207", "208", "51384"],
  "pmc": "PMC5395363",
  "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5395363/bin/nihm531061f6.jpg" }
}

kevinxin90 commented 4 years ago

@ariutta Hi Anders, thanks for the clarification. That structure looks great to me! We can proceed setting up an API for it once the JSON file is ready. Thank you!

AlexanderPico commented 4 years ago

Add "pmid" as well.

AlexanderPico commented 4 years ago

Note @ariutta: the example figureUrl is one of the ones that does not resolve. I'm assuming it's just a bad example?

AlexanderPico commented 4 years ago

@kevinxin90 We can give your a smaller JSON on Monday and then provide a much larger one a week later. Or we could just wait and provide the larger one later. Do you have a preference? Would it just be "busy work" to process our JSON twice or would you prefer to solidify the path on an early version of the file and then re-process again later?

kevinxin90 commented 4 years ago

@AlexanderPico Hi Alex, sorry I just saw this thread. It's fine to provide the final JSON when everything is ready. The parser should be very straightforward.

ariutta commented 4 years ago

Here's a newline delimited JSON file with the format from my earlier comment: https://www.dropbox.com/s/cbtamwk9u0xdhgo/pfocr_biothings.ndjson?dl=0

These are from figures that our system has identified as pathways and that had at least three recognized genes in our OCR process.

ariutta commented 4 years ago

@AlexanderPico I'll work on adding the PMID soon, but I wasn't able to get it into this file.

@kevinxin90 we will have additional results coming, probably next week.

ariutta commented 4 years ago

@kevinxin90 for _id, I left the .jpg extension on. Also, I made genes strings instead of numbers. If you want either of these changed, just let me know.

ariutta commented 4 years ago

@andrewsu and @kevinxin90, I'm going to mark this as done. Here are summary stats for our exported file pfocr_biothings.ndjson (the same file I mentioned above):

figure source: pfocr20191102_93k. These were the figures we collected on 2019-11-02, limited to the top 93,000 (as sorted by PMC relevance score) from our figure query.
OCR: Google Cloud Vision (GCV), performed in January
Image classification: GCV AutoML model trained, validated and tested on a set of 10k figures manually labeled as pathway or other. Performed in February. Yielded a pathway score between 0 (not a pathway) and 1 (is a pathway).

The results were further limited to the 33,179 figures that both:

had a pathway score greater than 0.5
mentioned three or more recognized human genes

From these figures, we recognized:

736,260 total genes
12,201 unique genes

If you'd like, we can provide you with additional hits we got when we removed the limitation of "top 93,000 (as sorted by PMC relevance score)".

andrewsu commented 4 years ago

I think the ball is now in Kevin's court, but I don't think this issue should be closed until we actually complete the stated milestone -- ie, the creation of the BioThings API to serve PFOCR data. Kevin I know is working on this -- should be done in the next week or so.

(Minor issue, but does the latest version have PMIDs? Would be a nice-to-have, but obviously the PMCIDs will suffice too...)

AlexanderPico commented 4 years ago

Roger that! Anders has PMIDs in the pipeline. It will be a part of all future depositions. We wanted to give you file asap meet the milestone. We can update it later next week if you think PMIDs will be helpful for the Segment 1 demo. Otherwise, we'll save it for the next update, which will include a ton more content early in Segment 2.

andrewsu commented 4 years ago

I think PMIDs can wait until segment 2. Once Kevin creates the API based on the dropbox file linked above, we'll ask you to check it out. After we're all happy with it, we can close this ticket...

kevinxin90 commented 4 years ago

@andrewsu @AlexanderPico @ariutta The PFOCR API is alive now, please check it up, here are some query examples:

Query for a specific figure: https://pending.biothings.io/pfocr/geneset/PMC100008__mb2411709009.jpg
Query for a specific gene: https://pending.biothings.io/pfocr/query?q=associatedWith.genes:107
Query for a specific PMC ID: https://pending.biothings.io/pfocr/query?q=associatedWith.pmc:PMC2494582

AlexanderPico commented 4 years ago

Cool! I verified a handful of queries. It looks good to me.

ariutta commented 4 years ago

Looks good!

Minor question: if we're going to add a pmid field, should we rename the pmc field to pmcid?

andrewsu commented 4 years ago

Nice! I'll also note that multi-gene queries works, e.g., https://pending.biothings.io/pfocr/query?q=associatedWith.genes:27115%20AND%20associatedWith.genes:55811. I think this will work out perfectly for how we envisioned a second layer of BTE prioritization. For example, suppose for a given query, BTE comes back with ~100 reasoning chains (expressed as paths of biomedical entities). We could query PFOCR to look for pathway figures that contain multiple ones of those entities. Right now the API is limited to genes, but later we will expand to other entity types as well...

Before the demo, we should try to flesh out an example like this in one of our example notebooks (PREDICT_demo, EXPLAIN_demo, or tidbit 2).

ariutta commented 4 years ago

@AlexanderPico @kevinxin90 I assume the answer to this question is "no, we'll use pmc and pmid":

Minor question: if we're going to add a pmid field, should we rename the pmc field to pmcid?

kevinxin90 commented 4 years ago

@ariutta Hi Anders, my bad. I missed this thread. I think "pmc" is fine. Normally, we label "pmid" as "pubmed". Thanks!

AlexanderPico commented 4 years ago

@andrewsu APIs confirmed. Is this one ready to close? Do you need anything on this one for the demo?

andrewsu commented 4 years ago

I think we are good, please close it out! (We are trying to put together a notebook that demonstrates the use of the PFOCR API with BTE for the demo next week. If anyone on your end has bandwidth to work on that, let me know!)

wikipathways / pathway-figure-ocr

Establish a BioThings API for access to PFOCR data #12