关于subj_tok_span和obj_tok_span

``"In fact, the punctuation marks after tokenization will not be counted towards the length, and the index is indeed left-open and right-closed. This is because the author added 1 to the token-level position in the code. Therefore, when you convert the character-level position to the token position, you should also subtract 1." This is my implementation of this

def calculate_tok_span(text, char_span):
    """Calculate the token span based on char span."""
    # Adjust char_span to remove leading and trailing spaces
    span_text = text[char_span[0]:char_span[1]]
    trimmed_span_text = span_text.strip()
    start_adjustment = span_text.index(trimmed_span_text)
    end_adjustment = len(span_text) - len(trimmed_span_text) - start_adjustment
    char_span = (char_span[0] + start_adjustment, char_span[1] - end_adjustment)
    encoding = tokenizer.encode_plus(text, return_offsets_mapping=True, add_special_tokens=True)
    offset_mapping = encoding["offset_mapping"]
    start_token, end_token = None, None
    for idx, (start, end) in enumerate(offset_mapping):
        # Check for the start of the entity
        if start_token is None and start <= char_span[0] < end:
            start_token = idx
        # Check for the end of the entity
        if start < char_span[1] <= end:
            end_token = idx + 1  # +1 to make it exclusive
            break
    # This handles cases where only a part of the token is annotated
    if end_token is None:
        end_token = start_token + 1
    return [start_token-1, end_token-1]

wtangdev / UniRel

关于subj_tok_span和obj_tok_span #13