wikipathways / pathway-figure-ocr

Extracting gene sets from published pathway figures
Apache License 2.0
15 stars 2 forks source link

Upgrade of pathway figure collection, processing and deposition pipeline #20

Open AlexanderPico opened 3 years ago

AlexanderPico commented 3 years ago

We plan to make incremental improvements to the processing of pathway figure content and annotations. First, we will modularize the processing pipeline and define input/output interfaces for each step. For example, one module will take any image file along with an optional PMCID as input and perform the OCR and processing required to generate a standard output of OCR-extracted text and metadata. An independent module will take this standardized content as input to perform normalization, transformations, matching and other processing steps in order to generate a standard output of identified genes, chemicals and diseases, along with metadata. We will also increase the automation of the pipeline as part of the modularization and refactoring, focusing initially on command line interface implementations that can later be programmatically called and scheduled.

AlexanderPico commented 2 years ago

@ariutta Can you bullet point some of the upgrades performed over past year? Also add bullet points for things to upgrade in the next round?

ariutta commented 2 years ago

Completed:

Upcoming upgrades: