wenwenyu / PICK-pytorch

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
https://arxiv.org/abs/2004.07464
MIT License
559 stars 193 forks source link

Training on Docbank and the MAX_BOXES_NUM variable #66

Closed sariabod closed 3 years ago

sariabod commented 3 years ago

Thanks for providing the code to your wonderful paper. As I am digging through and trying to understand whats going on under the hood I ran into something that I do not understand.

Clarification on the variable MAX_BOXES_NUM - in the documentation, this is listed as an optional field to change. When you look at the docbank boxes, some of them are north of 1000 for a single document. Does this mean if you leave this variable at the default 70, the model will only use the first 70 boxes and ignore the rest? Trying to up this number anywhere close to say 1000 it just eats up the vram and the training crashes. Any clarification on this variable and how it affects training/inference would be greatly appreciated.

Thanks!

victor-ab commented 3 years ago

@sariabod, did you clarify your questions on MAX_BOXES_NUM? If so, can you share it?

sariabod commented 3 years ago

@victor-ab Digging through the code, from what I could gather, they sort the boxes then truncate anything over MAX_BOXES_NUM (if you have more entries then this number). I had to do some preprocessing of the data since after ocr some of the documents would have over 1000 elements. Being able to strip out things you know are not needed helped bring the documents down to a manageable level.