Closed dassaswat closed 12 months ago
Hi, you may use the following code snippet to calculate text length:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('hkunlp/instructor-large')
print(len(tokenizer('hello, world')['input_ids']))
So I am trying to push embeddings to a vector database, and before encoding documents using instructor I want to chunk the documents to something less than 512 tokens. I am using langchain. With RecursiveCharacterTextSplitter in langchain we need to pass a length_function: Callable[[str], int] = len.
How do I write this length function? How do I access the tokenizer. Now the only way I can get access to the tokenizer is only after initialising the model.
Example implementation with openai embedding model