Closed vyau closed 6 months ago
Hi, Thanks a lot for your interest in the INSTRUCTOR!
The input text will be truncated if the length is larger than 512.
@hongjin-su Does that mean if chunks of 1000 tokens are passed to the embedding model the remaining 488 tokens are lost? and how would you generate the embeddings for a long text or document with multiple pages?
See here https://github.com/HKUNLP/instructor-embedding/blob/main/InstructorEmbedding/instructor.py#L321. You can then look into tokenizers docs and find out that any token over 512 is scraped.
Also the unit used for sequence length is TOKEN not byte or character.
Thanks a lot for the reply! @hynky1999 Feel free to re-open the issue if you have any further questions or comments!
Hi Instructor team: When I feed content into instructor to generate embedding, I saw this in stdout:
max_seq_length 512
I assume that means there is some input cap at 512 bytes? What happen if I my input is larger than that size? thanks