xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.78k stars 131 forks source link

max_sequence_length and splitting text into smaller chunks #77

Closed dassaswat closed 7 months ago

dassaswat commented 10 months ago

So I am trying to push embeddings to a vector database, and before encoding documents using instructor I want to chunk the documents to something less than 512 tokens. I am using langchain. With RecursiveCharacterTextSplitter in langchain we need to pass a length_function: Callable[[str], int] = len.

How do I write this length function? How do I access the tokenizer. Now the only way I can get access to the tokenizer is only after initialising the model.

Example implementation with openai embedding model


import tiktoken
tokenizer = tiktoken.get_encoding('cl100k_base')
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)
hongjin-su commented 10 months ago

Hi, you may use the following code snippet to calculate text length:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('hkunlp/instructor-large')
print(len(tokenizer('hello, world')['input_ids']))