wtangdev / UniRel

released code for our EMNLP22 paper: UniRel: Unified Representation and Interaction for Joint Relational Triple Extraction
Apache License 2.0
82 stars 17 forks source link

关于subj_tok_span和obj_tok_span #13

Closed Ariel-lu closed 1 year ago

Ariel-lu commented 1 year ago

您好,我想问下数据集中的subj_tok_span和obj_tok_span存放的实体位置索引是句子分词后的索引吗?如果是,分词后索引的位置有没有考虑[cls]的位置,还有token的索引是左闭右开吗?

muqishan commented 1 year ago

``"In fact, the punctuation marks after tokenization will not be counted towards the length, and the index is indeed left-open and right-closed. This is because the author added 1 to the token-level position in the code. Therefore, when you convert the character-level position to the token position, you should also subtract 1." This is my implementation of this

def calculate_tok_span(text, char_span):
    """Calculate the token span based on char span."""
    # Adjust char_span to remove leading and trailing spaces
    span_text = text[char_span[0]:char_span[1]]
    trimmed_span_text = span_text.strip()
    start_adjustment = span_text.index(trimmed_span_text)
    end_adjustment = len(span_text) - len(trimmed_span_text) - start_adjustment
    char_span = (char_span[0] + start_adjustment, char_span[1] - end_adjustment)
    encoding = tokenizer.encode_plus(text, return_offsets_mapping=True, add_special_tokens=True)
    offset_mapping = encoding["offset_mapping"]
    start_token, end_token = None, None
    for idx, (start, end) in enumerate(offset_mapping):
        # Check for the start of the entity
        if start_token is None and start <= char_span[0] < end:
            start_token = idx
        # Check for the end of the entity
        if start < char_span[1] <= end:
            end_token = idx + 1  # +1 to make it exclusive
            break
    # This handles cases where only a part of the token is annotated
    if end_token is None:
        end_token = start_token + 1
    return [start_token-1, end_token-1]