关于数据生成问题

zxk19981227 commented 4 years ago

代码中使用的GENIA数据集解析的时候，将start和end按照出现顺序一一匹配，但是GENIA是一个nest的数据集，这种做法怎么避免nest的出现？

yahshibu commented 4 years ago

我听不懂中文.

Thank you for having an interest!

Your question translated into English is as follows:

On data generation issues
When parsing the GENIA dataset used in the code, the start and end are matched in order of occurrence, but GENIA is a nest dataset, how to avoid nest in this way?

The following part of my code is pairing "start" and "end" matched in order of occurrence. I'm using a stack. https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L154-L165

A stack enables us to extract nested structure from a sequence of "start"s and "end"s.

zxk19981227 commented 4 years ago

Thanks sir,maybe i ignore some infomation there. But when i use your code to process genia data and use label to mark the begin and end indenpendently, some error occors for some tags' end position out of the max lenght. I was confused and can't understand some part of your code. Could you tell me some info about the process methods?

yahshibu commented 4 years ago

First of all, please make sure that you are using the following corpus. http://www.nactem.ac.uk/GENIA/current/GENIA-corpus/Part-of-speech/GENIAcorpus3.02p.tgz I parsed this corpus with my code again, but an error didn't occur.

I explain the code with the following example. <cons lex="IL-2_gene_expression" sem="G#other_name"><cons lex="IL-2_gene" sem="G#DNA_domain_or_region"><w c="NN">IL-2</w> <w c="NN">gene</w></cons> <w c="NN">expression</w></cons> This consists of 179 characters. There are three words, IL-2, gene, and expression. And, there are two mentions, IL-2 gene and IL-2 gene expression.

https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L97 This words_begin list contains the end positions of <w tags in the input. It doesn't mean the positions where </w> tags end. It means the positions where <w tags end. And, those positions are equal to the positions where words start. The list will be [113, 132, 158] with the above example.

https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L107 This words_end list contains the start positions of </w> tags in the input. The list will be [117, 136, 168] with the above example.

https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L116 This words list contains the words in the input. The list will be ["IL-2", "gene", "expression"].

https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L125 This mentions_begin list contains the end positions of <cons tags in the input. The list will be [52, 103] with the above example.

https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L135 This mentions_end list contains the start positions of </cons> tags in the input. The list will be [140, 172] with the above example.

https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L144 This tags list contains the labeled NER classes of the mentions. The list will be ["G#other_name", "G#DNA_domain_or_region"].

In summary, the lists will be as follows:

words_begin = [113, 132, 158]
words_end =  [117, 136, 168]
words = ["IL-2", "gene", "expression"]
mentions_begin = [52, 103]
mentions_end = [140, 172]
tags = ["G#other_name", "G#DNA_domain_or_region"]

The number of elements in each of the former three lists is three and equal to the number of words. The number of elements in each of the latter three lists is two and equal to the number of mentions. Please note that the orders of the elements in both words_begin and words_end are corresponding to the order of elements in words, and note that the order of the elements in mentions_begin is corresponding to that of the elements in tags. The order of the elements mentions_end is NOT corresponding.

https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L156 This for-statement counts up from 0 to the number of characters in the input. In the case of the above example, it counts up from 0 to 178.

https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L157 This part checks if the mention_begins list contains the current number index. If the mention_begins list contains the current number, an instance of Label class is created. Here, we need to know the start word-index and the NER tag for this instance. In order to get the start word-index of the mention, we have to find the word with which the mention starts. We can find the start word by finding the smallest number in the words_begin list that is larger than the current number. If the current number is 52 (which is included in the mentions_begin), the smallest number in the words_begin that is larger than 52 is 113. 113 is the start position of IL-2. It means IL-2 is the start word of a mention. What we want in the end is the word-index, not the word. We can get the word-index of IL-2 by checking the index number of IL-2 in the words list or the index of 113 in the words_begin list. And, it is better to use words_begin instead of words because words can have multiple same words in the input. In terms of the NER tag, we only have to find the corresponding element in the tags list. The instance of Label class is added to stack after the start word-index and the NER tag were substituted.

https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L162 This part checks if mention_ends contains the current number index. If the mentnions_end list contains the current number, The topmost instance in the stack (most recently created instance) is taken out. Here, we need to know the end word-index for this instance. In order to get the end word-index of the mention, we have to find the word with which the mention ends. We can find the end word by finding the largest number in the words_end list that is smaller than the current number. If the current number is 140 (which is included in the mentions_end), the largest number in the words_end that is smaller than 140 is 136. 136 is the end position of gene. It means gene is the end word of a mention. We can get the word-index of "gene" by checking the index of "136" in the words_end list.

I describe how the lists will change in this for-loop below:

# index == 52
stack: [] -> [(0, None, "G#other_name")]
queue: [] -> []
tags: ["G#other_name", "G#DNA_domain_or_region"] -> ["G#DNA_domain_or_region"]

# index == 103
stack: [(0, None, "G#other_name")] -> [(0, None, "G#other_name"), (0, None, "G#DNA_domain_or_region")]
queue: [] -> []
tags: ["G#DNA_domain_or_region"] -> []

# index == 140
stack: [(0, None, "G#other_name"), (0, None, "G#DNA_domain_or_region")] -> [(0, None, "G#other_name")]
queue: [] -> [(0, 2, "G#DNA_domain_or_region")]
tags: [] -> []

# index == 172
stack: [(0, None, "G#other_name")] -> []
queue: [(0, 2, "G#DNA_domain_or_region")] -> [(0, 2, "G#DNA_domain_or_region"), (0, 3, "G#other_name")]
tags: [] -> []

zxk19981227 commented 4 years ago

Thank you very much, sir! I have worked on this dataset for several day and your reply helps me solve the question. I misunderstood some part in your code and i have finished the work !

yahshibu commented 4 years ago

You're welcome!

yahshibu / nested-ner-tacl2020-transformers

关于数据生成问题 #2