max_sequence_length and splitting text into smaller chunks

So I am trying to push embeddings to a vector database, and before encoding documents using instructor I want to chunk the documents to something less than 512 tokens. I am using langchain. With RecursiveCharacterTextSplitter in langchain we need to pass a length_function: Callable[[str], int] = len.

How do I write this length function? How do I access the tokenizer. Now the only way I can get access to the tokenizer is only after initialising the model.

Example implementation with openai embedding model


import tiktoken
tokenizer = tiktoken.get_encoding('cl100k_base')
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

xlang-ai / instructor-embedding

max_sequence_length and splitting text into smaller chunks #77