Closed Ariel-lu closed 1 year ago
``"In fact, the punctuation marks after tokenization will not be counted towards the length, and the index is indeed left-open and right-closed. This is because the author added 1 to the token-level position in the code. Therefore, when you convert the character-level position to the token position, you should also subtract 1." This is my implementation of this
def calculate_tok_span(text, char_span):
"""Calculate the token span based on char span."""
# Adjust char_span to remove leading and trailing spaces
span_text = text[char_span[0]:char_span[1]]
trimmed_span_text = span_text.strip()
start_adjustment = span_text.index(trimmed_span_text)
end_adjustment = len(span_text) - len(trimmed_span_text) - start_adjustment
char_span = (char_span[0] + start_adjustment, char_span[1] - end_adjustment)
encoding = tokenizer.encode_plus(text, return_offsets_mapping=True, add_special_tokens=True)
offset_mapping = encoding["offset_mapping"]
start_token, end_token = None, None
for idx, (start, end) in enumerate(offset_mapping):
# Check for the start of the entity
if start_token is None and start <= char_span[0] < end:
start_token = idx
# Check for the end of the entity
if start < char_span[1] <= end:
end_token = idx + 1 # +1 to make it exclusive
break
# This handles cases where only a part of the token is annotated
if end_token is None:
end_token = start_token + 1
return [start_token-1, end_token-1]
您好,我想问下数据集中的subj_tok_span和obj_tok_span存放的实体位置索引是句子分词后的索引吗?如果是,分词后索引的位置有没有考虑[cls]的位置,还有token的索引是左闭右开吗?