Closed zxk19981227 closed 4 years ago
我听不懂中文.
Thank you for having an interest!
Your question translated into English is as follows:
On data generation issues
When parsing the GENIA dataset used in the code, the start and end are matched in order of occurrence, but GENIA is a nest dataset, how to avoid nest in this way?
The following part of my code is pairing "start" and "end" matched in order of occurrence. I'm using a stack. https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L154-L165
A stack enables us to extract nested structure from a sequence of "start"s and "end"s.
Thanks sir,maybe i ignore some infomation there. But when i use your code to process genia data and use label to mark the begin and end indenpendently, some error occors for some tags' end position out of the max lenght. I was confused and can't understand some part of your code. Could you tell me some info about the process methods?
First of all, please make sure that you are using the following corpus. http://www.nactem.ac.uk/GENIA/current/GENIA-corpus/Part-of-speech/GENIAcorpus3.02p.tgz I parsed this corpus with my code again, but an error didn't occur.
I explain the code with the following example.
<cons lex="IL-2_gene_expression" sem="G#other_name"><cons lex="IL-2_gene" sem="G#DNA_domain_or_region"><w c="NN">IL-2</w> <w c="NN">gene</w></cons> <w c="NN">expression</w></cons>
This consists of 179 characters. There are three words, IL-2
, gene
, and expression
. And, there are two mentions, IL-2 gene
and IL-2 gene expression
.
https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L97
This words_begin
list contains the end positions of <w
tags in the input. It doesn't mean the positions where </w>
tags end. It means the positions where <w
tags end. And, those positions are equal to the positions where words start. The list will be [113, 132, 158] with the above example.
https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L107
This words_end
list contains the start positions of </w>
tags in the input. The list will be [117, 136, 168] with the above example.
https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L116
This words
list contains the words in the input. The list will be ["IL-2", "gene", "expression"].
https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L125
This mentions_begin
list contains the end positions of <cons
tags in the input. The list will be [52, 103] with the above example.
https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L135
This mentions_end
list contains the start positions of </cons>
tags in the input. The list will be [140, 172] with the above example.
https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L144
This tags
list contains the labeled NER classes of the mentions. The list will be ["G#other_name", "G#DNA_domain_or_region"].
In summary, the lists will be as follows:
words_begin = [113, 132, 158]
words_end = [117, 136, 168]
words = ["IL-2", "gene", "expression"]
mentions_begin = [52, 103]
mentions_end = [140, 172]
tags = ["G#other_name", "G#DNA_domain_or_region"]
The number of elements in each of the former three lists is three and equal to the number of words. The number of elements in each of the latter three lists is two and equal to the number of mentions. Please note that the orders of the elements in both words_begin
and words_end
are corresponding to the order of elements in words
, and note that the order of the elements in mentions_begin
is corresponding to that of the elements in tags
. The order of the elements mentions_end
is NOT corresponding.
https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L156 This for-statement counts up from 0 to the number of characters in the input. In the case of the above example, it counts up from 0 to 178.
https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L157
This part checks if the mention_begins
list contains the current number index
.
If the mention_begins
list contains the current number, an instance of Label class is created. Here, we need to know the start word-index and the NER tag for this instance. In order to get the start word-index of the mention, we have to find the word with which the mention starts. We can find the start word by finding the smallest number in the words_begin
list that is larger than the current number. If the current number is 52 (which is included in the mentions_begin
), the smallest number in the words_begin
that is larger than 52 is 113. 113 is the start position of IL-2
. It means IL-2
is the start word of a mention. What we want in the end is the word-index, not the word. We can get the word-index of IL-2
by checking the index number of IL-2
in the words
list or the index of 113 in the words_begin
list. And, it is better to use words_begin
instead of words
because words
can have multiple same words in the input. In terms of the NER tag, we only have to find the corresponding element in the tags
list. The instance of Label class is added to stack
after the start word-index and the NER tag were substituted.
https://github.com/yahshibu/nested-ner-tacl2020-transformers/blob/ad620aa2cdac876c55b7aac38a30c5332a4beedc/parse_genia.py#L162
This part checks if mention_ends
contains the current number index
.
If the mentnions_end
list contains the current number, The topmost instance in the stack (most recently created instance) is taken out. Here, we need to know the end word-index for this instance. In order to get the end word-index of the mention, we have to find the word with which the mention ends. We can find the end word by finding the largest number in the words_end
list that is smaller than the current number. If the current number is 140 (which is included in the mentions_end
), the largest number in the words_end
that is smaller than 140 is 136. 136 is the end position of gene
. It means gene
is the end word of a mention. We can get the word-index of "gene" by checking the index of "136" in the words_end
list.
I describe how the lists will change in this for-loop below:
# index == 52
stack: [] -> [(0, None, "G#other_name")]
queue: [] -> []
tags: ["G#other_name", "G#DNA_domain_or_region"] -> ["G#DNA_domain_or_region"]
# index == 103
stack: [(0, None, "G#other_name")] -> [(0, None, "G#other_name"), (0, None, "G#DNA_domain_or_region")]
queue: [] -> []
tags: ["G#DNA_domain_or_region"] -> []
# index == 140
stack: [(0, None, "G#other_name"), (0, None, "G#DNA_domain_or_region")] -> [(0, None, "G#other_name")]
queue: [] -> [(0, 2, "G#DNA_domain_or_region")]
tags: [] -> []
# index == 172
stack: [(0, None, "G#other_name")] -> []
queue: [(0, 2, "G#DNA_domain_or_region")] -> [(0, 2, "G#DNA_domain_or_region"), (0, 3, "G#other_name")]
tags: [] -> []
Thank you very much, sir! I have worked on this dataset for several day and your reply helps me solve the question. I misunderstood some part in your code and i have finished the work !
You're welcome!
代码中使用的GENIA数据集解析的时候,将start和end按照出现顺序一一匹配,但是GENIA是一个nest的数据集,这种做法怎么避免nest的出现?