wenwenyu / PICK-pytorch

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
https://arxiv.org/abs/2004.07464
MIT License
556 stars 193 forks source link

重新全文序列标注损失了OCR已经区分开的词组信息? #13

Closed jingmouren closed 4 years ago

jingmouren commented 4 years ago

看了下最后输出关键词及类别的过程,好像是将全部文字组成了一个长序列,然后在里头提取出了各字的标签,再将连续同标签的字组成知名并打个此标签。这个过程中OCR部分识别为一个box里的词组的前后部分都可能被分配到其他组。 比如OCR出了三块"XXX","YYY","ZZZ“,序列标注的拆分结果可能变成了"XX","XYY","YZZ","Z"。 假设OCR的BOX组合的过程准确率很高,那如何更好利用这部分信息?

wenwenyu commented 4 years ago

@jingmouren 您提出的问题很好。

  1. 关于重新全文序列标注是否损失了OCR已经区分开的词组信息? 我们实验过不将box组成全文docuemnt level序列标注,而在box level单独做序列标注的实验(即送入CRF层的输入shape是[B*N, T, D] 而不是[B, N*T, D] ),实验结果并没有组成全文document level的方式好。原因是组成全文的好处是在CRF内学习到的转移矩阵是全局的,而如果按照box level做CRF的话,学习的转移矩阵只包含box level层面的语义信息,无法从document level的角度在整个数据集上得到最优的转移矩阵。换句话说,box level缺乏足够的语义信息,而document level反而能够重新学习到OCR已经区分开的语义信息,当出现一个entity跨行的情况(比如地址存在多行情况),box level方法效果会降低。

  2. 然后在里头提取出了各字的标签,再将连续同标签的字组成知名并打个此标签。这个过程中OCR部分识别为一个box里的词组的前后部分都可能被分配到其他组

    您这里的表述我没太看明白,我先按照我理解到的问题回答下,关于打标签过程是否会存在一个box内的text会不会被分配给其他entity的情况? 代码中提供了三种打标签的方式,即类(Document)中iob_tagging_type参数表示了打标签的方式(包括box_level, document_level, box_and_within_box_level),在box_level模式下,标签是不会跨box标注的,也就是说不会被分配到其他entity类别中,但这个模式的标注并不精确;document_level是先组成全文再去根据entity的value来打标注,这种方式可能会存在其他box中的text被标注成其他entity情况,但由于entity的值一般很唯一,所以出现的情况很小,基本可以忽略,模型也有一定容错能力;而第三种box_and_within_box_level是介于前两种之间的标注方式,具体细节可以看code。我们实验中会观察到box_level的效果低于其他两种的现象,如何选择合适的标注方式还是得看具体的数据和业务需求。

  3. 假设OCR的BOX组合的过程准确率很高,那如何更好利用这部分信息? 关于如果利用OCR已经区分开的词组信息,这个问题其实可以将类(Document)中iob_tagging_type参数设置为box_and_within_box_level来解决,因为此种方式在打BIO标签时,已经通过B-entity标注区分了各个box level内的text。送入模型的label能够让模型学习到OCR中的语义信息。至于是否还有更好更合理的方式,欢迎讨论交流。

jingmouren commented 4 years ago

除了如何打标签。decoder的crf是不是也对这个过程有影响? 是不是改造crf,就可能让预测时的类的begin和box的begin对齐,end和box的end对齐?或者说这么改造使crf不可导,不可行?

wenwenyu commented 4 years ago

@jingmouren box level和document level对CRF影响很大,改造CRF是指怎么改造?对齐box的begin和end在这个任务没有影响。

jingmouren commented 4 years ago

https://zhuanlan.zhihu.com/p/77868938 实体关系抽取方法总结 的文章 说BiLSTM-CRF在复杂类别情况下相比BiLSTM-softmax并没有显著优势。 由于分词边界错误会导致实体抽取错误,基于LatticeLSTM[2]+CRF的方法可引入词汇信息并避免分词错误。 有没有 尝试过用softmax或者这个latticeLSTM啊?

另外,在某处看到短语和标签存在多对多情形。有例子吗

wenwenyu commented 4 years ago

@jingmouren 您分享的链接很有参考价值。我们最开始有尝试过在解码的时候丢掉crf而直接用softmax的方法,但发现一个现象是当样本量少的情况(少于万量级),softmax并不能很好的提取出的实体的相应span,预测出的类别也存在重复的情况,无法保证实体的完整性。关于LAN中提出的在实体关系抽取的任务中crf没有显著优势的观点,由于我们的方式只涉及实体抽取,没有关系抽取,所以观点是否适用于此,还需要进一步验证。

关于LatticeLSTM这篇文献引入词汇信息的方法是一个很好的思路,后面改进或许可以尝试。

在我们任务中,预先假设了短语和标签是一对一的情况,关于多对多情况目前没有考虑进去,也没做过尝试,复杂场景下可能是有必要考虑多对多的情况,后续工作可能考虑。

AtulKumar4 commented 3 years ago

English translation of the whole conversation @wenwenyu @jingmouren

Looking at the process of outputting keywords and categories at the end, it seems that all the texts are formed into a long sequence, and then the tags of each word are extracted from it, and then the consecutive words with the same tag are formed into a well-known and labeled. In this process, the OCR part is recognized as the front and back parts of the phrase in a box may be assigned to other groups. For example, OCR has three blocks "XXX", "YYY", "ZZZ", and the split result of sequence labeling may become "XX", "XYY", "YZZ", "Z". Assuming that the OCR BOX combination process has a high accuracy rate, how to make better use of this information?

Does the relabeling of the full text sequence lose the phrase information that OCR has already distinguished? We have experimented not to combine boxes into full text docuemnt level sequence annotations, but to do sequence annotation experiments separately at box level (that is, the input shape sent to the CRF layer is [BN, T, D] instead of [B, NT , D] ), the experimental results are not as good as the document level of the full text. The reason is that the advantage of composing the full text is that the transfer matrix learned in the CRF is global, and if the CRF is done according to the box level, the learned transfer matrix only contains the semantic information at the box level, and cannot be used in the entire data from the perspective of the document level. The optimal transition matrix is obtained on the set. In other words, the box level lacks sufficient semantic information, but the document level can relearn the semantic information that OCR has distinguished. When an entity crosses lines (such as the address has multiple lines), the effect of the box level method will be reduced. .

Then extract the label of each word in it, and then group the consecutive words with the same label into a well-known name and make this label. In this process, the OCR part is recognized as the front and back parts of the phrase in a box may be assigned to other groups

I didn't understand your statement very well. I will answer the question that I understand first. Will the text in a box be assigned to other entities in the labeling process? The code provides three tagging methods, that is, the iob_tagging_type parameter in the class (Document) indicates the tagging method (including box_level, document_level, box_and_within_box_level). In box_level mode, tags will not be labeled across boxes, that is It is said that it will not be assigned to other entity categories, but the labeling of this mode is not accurate; document_level is to compose the full text first and then label it according to the value of the entity. In this way, the text in other boxes may be labeled as other Entity situation, but because the value of entity is generally unique, the situation is very small and can basically be ignored. The model also has a certain fault tolerance; and the third box_and_within_box_level is a labeling method between the first two. You can see the specific details. code. In our experiments, we will observe that the effect of box_level is lower than the other two phenomena. How to choose the appropriate labeling method still depends on the specific data and business requirements.

Assuming that the OCR BOX combination process has a high accuracy rate, how to make better use of this information? Regarding the phrase information that has been distinguished by OCR, this problem can actually be solved by setting the iob_tagging_type parameter in the class (Document) to box_and_within_box_level, because in this way, when the BIO tag is marked, each box level has been distinguished by B-entity labeling. The text inside. The label sent to the model allows the model to learn the semantic information in the OCR. As for whether there is a better and more reasonable way, welcome to discuss and exchange.

Except how to tag. Does the crf of the decoder also affect this process? Is it possible to modify the crf so that the beginning of the class and the beginning of the box can be aligned, and the end of the box can be aligned? In other words, such a transformation makes crf unguided and unfeasible?

The box level and document level have a great influence on CRF. How do you modify CRF? The begin and end of the aligned box have no effect in this task. The article on https://zhuanlan.zhihu.com/p/77868938 Summary of Entity Relation Extraction Methods says that BiLSTM-CRF has no significant advantage over BiLSTM-softmax in the case of complex categories. Because the error of word segmentation boundary will lead to entity extraction error, the method based on LatticeLSTM[2]+CRF can introduce vocabulary information and avoid word segmentation errors. Have you tried softmax or this latticeLSTM?

In addition, there is a many-to-many situation where phrases and tags are seen somewhere. Is there an example

@jingmouren The link you shared is of great reference value. We first tried to discard crf and directly use softmax when decoding, but we found a phenomenon is that when the sample size is small (less than ten thousand), softmax cannot extract the corresponding entity well. Span, the predicted category is also duplicated, and the integrity of the entity cannot be guaranteed. Regarding the view that crf has no significant advantage in the entity relationship extraction task proposed in LAN, since our method only involves entity extraction and no relationship extraction, whether the view is applicable to this requires further verification.

The method of introducing vocabulary information in LatticeLSTM is a good idea, and you may try to improve it later.

In our task, it is assumed that the phrase and the label are one-to-one. The many-to-many situation has not been considered, and no attempt has been made. It may be necessary to consider the many-to-many situation in complex scenarios. Follow-up work May be considered.

wenwenyu commented 3 years ago

@AtulKumar4 Thanks.

tengerye commented 3 years ago

So far, I have only observed incorrect tokenization of a single box. For example, "XXXX","YYY" was predicted as "XX" (label A), "X" (label O), "X" (label A) "YYY" (label C). One way to solve this is to record the indices of original OCR boxes and use combination strategies (e.g., voting) to produce unified tags. In the above example, you will get label A for "XXXX".

There is another way (I did not try yet): using only the first token as the input of fully connected layer for prediction. It ensures the number of labels equal to the number of boxes. But remember, before the FC layer, to fuse the information across sequence axis.

ziodos commented 3 years ago

@tengerye can you please give more explanations about "voting" strategy .