wenwenyu / PICK-pytorch

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
https://arxiv.org/abs/2004.07464
MIT License
553 stars 191 forks source link

Preparing tsv file for custom dataset #69

Open prabhakar-sivanesan opened 3 years ago

prabhakar-sivanesan commented 3 years ago

Hi, firstly thanks for the model it worked perfectly good on the custom dataset. But I have two doubts in preparing the tsv data for training.

1) When I have 3 words associated to one entity, does all the three words has to seperatly annotated in tsv file or they have to be combined into one ?

Example, this is the data

sample

In shipping address column, Kothuri Sai Kiran is a name. My OCR model gives these 3 words separatly as Kothuri, Sai and Kiran. So while preparing the tsv file, can I annotate it as 3 different row like this,

18,1009,490,1198,490,1198,553,1009,553,Kothuri,name 19,1206,495,1501,495,1501,552,1206,552,Sai,name 20,1619,501,1707,501,1707,560,1619,560,Kiran,name

or all three words has to be combined like this,

18,1009,490,1707,501, 1707,560,1009,553, Kothuri Sai Kiran, name

2) When you see the Billing address column, I have the same name Kothuri Sai Kiran. Is it possible to tag this name to the same entity "name" ? In a nut shell, Can I have multiple ocr data tagged to one entity for a single image file ?

Looking forward to your response.

ninjakx commented 3 years ago

@prabhakar-sivanesan : Is it detecting all the entity in your custom dataset? How many data samples did you pass to the model to get the better result?

prabhakar-sivanesan commented 3 years ago

@ninjakx I was training for only 5 entities and I used about 70 samples with 70/30 split. I was able to get better results for that.

Nivedita-mahato2 commented 3 years ago

@prabhakar-sivanesan Hi Prabhakar, would you let me know which annotation tool you used for preparing the custom dataset?