Continue automation of PFOCR

AlexanderPico commented 4 years ago

For example, developing machine learning methods to distinguish figures of pathway diagrams from other figure types returned by PMC pathway image searches.

andrewsu commented 4 years ago

The other part of this ticket will be automating the update of the API at https://pending.biothings.io/pfocr with whatever your latest PFOCR output is. We are working out the mechanism for this now, but basically you should plan on automatically updating your JSON file at some URL (maybe a github raw link, maybe a google drive or S3 link), and we will poll that file on a mutually agreed schedule to check for a data update. If yes, that will trigger a data processing pipeline which would then just need a manual sanity check before publishing to the live API.

AlexanderPico commented 4 years ago

@ariutta Can you add Andrew's piece (above) to the plan you were going to add here?

ariutta commented 4 years ago

The item @AlexanderPico mentioned is now working:

... developing machine learning methods to distinguish figures of pathway diagrams from other figure types returned by PMC pathway image searches.

For the trigger to let you know the file has been updated, we could use GitHub webhooks. With webhooks, you wouldn't need to poll for updates. Rather, you would subscribe to the repo in order to receive push events notifying you of changes.

We would need make a system to handle cases when we add more info to our data, e.g., we added chemical and disease mentions in this latest reporting period. We may be able to get away with just continuing to use JSON, and communicating with each other whenever the format will change. Another idea would be to use a format like protocol buffers in order to handle schema evolution. Other similar options are Avro, Thrift and FlatBuffers.

andrewsu commented 4 years ago

We have a variety of systems of triggering new builds of our BioThings APIs, the details of which I have remained blissfully ignorant. I'll pass along these ideas to the team...

Is the upstream automation done then? i.e., is there a JSON file that is being updated on a regular basis? If so, we should complete the automation on our side to push out new versions of the PFOCR API...

ariutta commented 4 years ago

We have a variety of systems of triggering new builds...

Yeah, even just for GitHub, there are multiple options! Let me know what the team prefers, and I'll follow suit.

Is the upstream automation done then? i.e., is there a JSON file that is being updated on a regular basis? If so, we should complete the automation on our side to push out new versions of the PFOCR API...

Yes and no. Yes, we do have a JSON file that is being updated on a regular basis, but in the sense that we are extracting more and more information from the same OCR text. First we identified genes, and then using pathway enrichment analysis, we identified diseases. More recently, we've identified chemicals and diseases as actual mentions in the OCR text.

We haven't run OCR on a new batch of figures in many months. We completed the step Alex mentioned originally, but then we started talking with a potential collaborator who said they could help with the automation, so we switched our focus to processing the OCR text. @AlexanderPico, do you think that collaboration is worth waiting for, or should we move forward on our own with running of OCR for batches of new figures?

AlexanderPico commented 4 years ago

It's still premature to completely automate all upstream steps for the following reasons:

We have 3 or 4 papers still to write about the current data in hand. These papers require their own coding and analysis work. They aim to demonstrate the value of the extracted information. They may reveal important changes needed in upstream steps.
The scope of the upstream content is not well defined. It's based on an arbitrary PubMed image query. We should reassess this before automating.
The source of the upstream content is weak. We are relying on HTML scrape of PubMed results. We should identify a beter source, for example from existing efforts (like Meta) or resources (like NML). If we demonstrate the value, then it should be trivial to collaborate to find a better source.
Funding is still uncertain. The WikiPathways grant ends in 2021, but we have 3 really good applications that touch on PFOCR (2 are already submitted). I'd like to know where the future support is coming from before we dive into a major automation project.

So, while continuing to automate some of the "last mile" steps makes perfect sense (e.g., generating JSON updates and coverting to GMT and other formats), we should hold off on complete upstream automation.

ariutta commented 4 years ago

Let's break it down like this:

Automate the system where when we update our results, that update is reflected in BTE.
Run our semi-automated PFOCR pipeline for newer figures. @AlexanderPico, do we want to create a schedule for how often we do this?

AlexanderPico commented 4 years ago

Yes, that should be the full focus and scope of this issue/task. Are there any remaining steps to automate within this scope? Let's detail those here and then close the issue when complete.
Scheduling the next update is on hold until future notice.

wikipathways / pathway-figure-ocr

Continue automation of PFOCR #17