Testing with out-of-sample

wenwenyu / PICK-pytorch

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

https://arxiv.org/abs/2004.07464

MIT License

553 stars 191 forks source link

Testing with out-of-sample #48

Closed karthikesh2020 closed 3 years ago

karthikesh2020 commented 3 years ago

While testing with out-of-sample images, i tried creating boxes_and_transcripts for new images using tesseract & prepare tsv files. but when i'm predicting output is not good as in-sample testing images.

kindly please let us know way to create bounding box files (tsv) for prediction for this model.

Also can we predict without tsv files?

Note: In tesseract i tried both WORD & TEXTLINE bounding boxes

tengerye commented 3 years ago

@karthikesh2020 What is the accuracy of Tesseract on your cases? A tool to recognize texts from images is necessary for our model.

karthikesh2020 commented 3 years ago

Firstly Thankyou @tengerye for fast response.

Tesseract accuracy is good for extracting text information but the bounding box is either at word or line or para or block level. For our current model, i think the bounding box info from tesseract is not good.

can you suggest any tool or model best for bounding box recognition suitable for this current model

tsv file is mandatory for prediction is it?

tengerye commented 3 years ago

@karthikesh2020 Yes, the model predicts on the texts (bounding boxes included) provided by OCR. Currently, the texts are either provided within the (open) dataset or from private tool (self-made). I suggest start with searching similar tools from big companies like Google, Facebook, etc.

All I can tell you is the OCR results make a huge difference.

karthikesh2020 commented 3 years ago

Sure @tengerye, I will tryout with Tesseract OCR itself which is google's free version.

I labelled my dataset manually & extracted tsv file.

can i label as below way & train our model? i mean each word is assigned with a label like vendor_name etc..

Screenshot 2020-10-13 at 7

tengerye commented 3 years ago

@karthikesh2020 Yes, and if there are too many words on each page, use GPUs with large memory and set limit of the number of sentence to whatever you need.

karthikesh2020 commented 3 years ago

Thank you so much @tengerye for the support. I will follow your resolution.

ninjakx commented 3 years ago

@karthikesh2020 : were you able to solve the problem?