wikipathways / pathway-figure-ocr

Extracting gene sets from published pathway figures
Apache License 2.0
15 stars 2 forks source link

Establish infrastructure to automate the PFOCR pipeline #14

Closed AlexanderPico closed 4 years ago

AlexanderPico commented 4 years ago

making it better able to keep up with the stream of new data being generated, while also back-filling with data from past publications. We will develop a system for automating the construction of the lexicon used in the named entity recognition steps. We will automate the regular normalization and deposition of the PFOCR data into our newly-created Translator API.

AlexanderPico commented 4 years ago

@ariutta Can you make a flow diagram in draw.io that depicts the major steps in our PFOCR pipeline. We can then annotate (e.g., with fill color) which are automated and which are manual. Maybe we can provide a rough percentage estimate of how much of the pipeline is automated by this Segment 1 deadline.

AlexanderPico commented 4 years ago

@ariutta Another aspect of this is "establishing infrasture" to automate in the future. Do you have ideas on tools we might want to use for this project to monitor scripts and future automation, e.g., Jenkins? Part of satifying this Segment 1 aim would simply be to identify and prototype that sort of infrastruture tooling.

AlexanderPico commented 4 years ago

PFOCR Pipeline for BTE

ariutta commented 4 years ago

As illustrated in the figure above, we propose a pipeline for performing OCR on figures and processing the results to identify entities of interest, such as genes. The items marked as Container can be packaged and deployed using Docker, with inter-container communication performed by means of an RPC system like grpc. The trigger to begin collecting figures will be chosen to be consistent with the rest of the BTE system, whether that be periodic like a nightly cron job or on-demand like a message queue. The subsequent workflow will be handled by a system like Pachyderm.

In the classify step, we will use a machine learning model that has been pre-trained on our collection of manually labeled figures (pathway vs. not-pathway). This model will definitely use computer vision to assign labels to figures and may additionally use text, such as the figure caption. For the figures classified as pathway, we will send them through an OCR processor to extract raw text and use our lexicon(s) and post-processing algorithms to extract entities such as genes from that text. Finally, we will export our results in formats requested by third-party consumers/hosts of our resulting data, such as gene sets by figure in GMT format for Enrichr.