wenwenyu / PICK-pytorch

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
https://arxiv.org/abs/2004.07464
MIT License
560 stars 193 forks source link

Information regarding OCR process being used in this code #67

Open kontact2pankaj opened 3 years ago

kontact2pankaj commented 3 years ago

Hi @wenwenyu , Please provide some information on the OCR module also... that will be really helpful. How are you extracting text from the test images? Are you using Tessaract or some other API? Is there any way by which someone can experiment with different OCR APIs in this code?

wenwenyu commented 3 years ago

Hi, in practice, we used the modified PSENet for text detection and MASTER for recognition, which was trained on task-specific data in order to achieve ideal performance.

We didn't use other public methods. The general tools such as Tessaract didn't satisfy the needs of performance.

This repo is decoupled from the OCR system. In the inference phase, it only needs the results of the OCR system. But in the training phase, there have two training ways. One way is using the human-annotated label including boxes, transcripts, and corresponding entities. But this training way has a gap between the human-annotated label and OCR system in the inference phase, because the human-annotated boxes didn't match exactly with the OCR system due to the latent error of detection. To decrease this gap or inconsistent, we actually use another training way. We combine the human-annotated label with the results of the OCR system to get an OCR system-oriented IOB label for training. We first calculate the overlap between the human-annotated boxes and OCR results, then a simple rule is used to decide the final IOB label. A box segment is considered to contain an entity if the overlap of the box is bigger than a manually set threshold, which this part of the code didn't make public. Back to your question, different OCR APIs can be used in experiments. But the final performance is decided jointly by the number and difficulty of data, the performance of OCR, and the training strategy.