yahshibu / nested-ner-tacl2020-transformers

Implementation of Nested Named Entity Recognition using BERT
GNU General Public License v3.0
138 stars 24 forks source link

data generation issues in genia dataset #3

Closed LiujunWang closed 4 years ago

LiujunWang commented 4 years ago

When parsing the GENIA dataset used in the code, some spans belong to two or more categories in the same sentence. Is there something wrong?

yahshibu commented 4 years ago

Thank you for having an interest!

There is nothing wrong.

These studies identify p21ras as a target of the same cells .

This is a sentence including in the GENIA dataset, and the span "p21ras" belongs to two categories, "protein" and "DNA". This is derived from the original annotation.

In the original corpus (GENIAcorpus3.02.merged.xml), the above span "p21ras" is labeled as follows: <cons lex="p21ras" sem="G#DNA_domain_or_region"><cons lex="p21ras" sem="G#protein_molecule"><w c="NN">p21ras</w></cons></cons>. This means that "p21ras" belongs to the two categories at least in this context.

LiujunWang commented 4 years ago

Thanks for your reply, I read a lot of papers about nested ner and I find that almost all papers assume that a span should belong to one category, which seems naturally ordinary. But now it seems that I ignored something. Take the liberty to ask, did you consider this problem (a span may belong to two or more categories) in this paper?

yahshibu commented 4 years ago

Yes, our paper considers this problem. To my understanding, the following papers take it into account, too.

Some of the other papers our paper refers to might deal with two or more categories, but I cannot exactly tell which papers do.

LiujunWang commented 4 years ago

Thank you very much, I understand more about the nested ner task.

yahshibu commented 4 years ago

You're welcome!