xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.78k stars 131 forks source link

what is the max input size? #72

Closed vyau closed 6 months ago

vyau commented 11 months ago

Hi Instructor team: When I feed content into instructor to generate embedding, I saw this in stdout:

max_seq_length 512

I assume that means there is some input cap at 512 bytes? What happen if I my input is larger than that size? thanks

hongjin-su commented 11 months ago

Hi, Thanks a lot for your interest in the INSTRUCTOR!

The input text will be truncated if the length is larger than 512.

amitduwal commented 11 months ago

@hongjin-su Does that mean if chunks of 1000 tokens are passed to the embedding model the remaining 488 tokens are lost? and how would you generate the embeddings for a long text or document with multiple pages?

hynky1999 commented 11 months ago

See here https://github.com/HKUNLP/instructor-embedding/blob/main/InstructorEmbedding/instructor.py#L321. You can then look into tokenizers docs and find out that any token over 512 is scraped.

hynky1999 commented 11 months ago

Also the unit used for sequence length is TOKEN not byte or character.

hongjin-su commented 6 months ago

Thanks a lot for the reply! @hynky1999 Feel free to re-open the issue if you have any further questions or comments!