zjunlp / DeepKE

[EMNLP 2022] An Open Toolkit for Knowledge Graph Extraction and Construction
http://deepke.zjukg.cn/
MIT License
3.38k stars 673 forks source link

Questions about entity markers and exceeding max_seq_length=1024 in document-level RE #264

Closed iden-alex closed 1 year ago

iden-alex commented 1 year ago

Thanks for the work I applied the document-level RE model to the Russian dataset. There were several questions:

1) What is the meaning of the symbol "-" in this list in read_docred? entity_type = ["-", "ORG", "-", "LOC", "-", "TIME", "-", "PER", "-", "MISC", "-", "NUM"] There are about 50 entities in my dataset, I made a similar list, but I don't understand "-" 2) Do I need to add the tokens [unused1], ... [unused 50] to tokenizer separately, or is this already being done somewhere in the code?

3) Why is just "*" used as tokens for entities for Roberta (this line), and not special tokens as for bert?

4) In the model, the text is divided into segments of 512 tokens (process_long_input). If the input length > 1024, then the text is truncated? can you specify a line of code where this happens?

TimelordRi commented 1 year ago

Hi, I will try to answer your questions.

Before I answer these question, you need to know what we are actually doing here. Note that we add two special token surrounded the entity after processing the sentences in a long text. For example, Tom was born in USA. After processing, it turns to [unused7] Tom [unused57] was born in [unused3] USA [unused53]. according to the entity_type list when using BERT.

The special token [unusedX] (X=0,1,2...) stands for the start of a certain type entity, while [unused(X+offset)] (offset in our code is 50, which can be modified) stands for the end of the entity. [unusedX] (X=0,1,2...) tokens in BERT vocabulary are randomly initiated and designed for convenient extending vocabulary by adding special tokens as you need.

Here, we use the special tokens to mark the entity in the input sentences, expecting to add more information (position, type, etc.) of entities.

  1. The "-" in entity_type list is not necessary, so you can delete that.
  2. [unusedX] (X=0,1,2...) will be encoded by the BERT tokenizer.
  3. For other models, there is no unused token in the vocabulary, thus we didn't apply this. But you can impliment this by manually addition.
  4. The function process_long_input is designed for sentences which length in 512~1024. We truncate sentence length > 1024 at this line (for the DocRED dataset), https://github.com/zjunlp/DeepKE/blob/dcbbb66a70b41e97cdcbfcc859be4e2c886291b3/src/deepke/relation_extraction/document/prepro.py#L233 If you want to input a sentence which length > 1024, you could modify the process_long_input function.