wenwenyu / PICK-pytorch

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
https://arxiv.org/abs/2004.07464
MIT License
556 stars 193 forks source link

Is it possible to extract multiple entities with the same label? #19

Closed tengerye closed 4 years ago

tengerye commented 4 years ago

Hi, in one of the provided annotation file,

{
    "company": "BOOK TA .K (TAMAN DAYA) SDN BHD",
    "date": "25/12/2018",
    "address": "NO.53 55,57 & 59, JALAN SAGU 18, TAMAN DAYA, 81100 JOHOR BAHRU, JOHOR.",
    "total": "9.00"
}

I am wondering if a document has multiple entities of the same type, e.g., several companies' names, is the model able to find them all? Such as ["company": "company1", "company": "company2"].

If the model supports such case, how shall I prepare the data in entities folder?

wenwenyu commented 4 years ago

Theoretically, It is possible to train with multi entities. As you provided examples, one way to achieve the goal is to label different tags for every company. But we haven't trained experiments on multiple entities' documents, so don't know what the performance is.

tengerye commented 4 years ago

@wenwenyu Thank you for your kind reply. What I mean is we don't know the number of companies beforehand. In other words, currently your model extract all entities belonging to the company. But if there are multiple company in the receipt, how could we distinguish them apart?

wenwenyu commented 4 years ago

The model can only do a predetermined number of entity tasks which is different from the traditional NER task. And it cannot extract an unknown or unseen entity type if training samples don't provide a predefined entity for it. So even if there have multiple company in the documents, it still thinks that there only have one company. The aim of this method tends to train a task-specific model, rather than a generalized model.