xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.85k stars 134 forks source link

Input Length / Accuracy #29

Closed bitnom closed 1 year ago

bitnom commented 1 year ago

Do you have any data on the performance given a range of input lengths? I'm working on neural search, and I came across instructor-xl as a potential replacement for text-embedding-ada-002, which has an context window of 8,191 tokens. Can instructor-xl handle that length without degrading? Any longer?

Issue 12 touched on this but didn't provide many details.

My immediate use is cosine similarity for search but I also have a need for clustering and categorization. Any info you can provide regarding the context length in relation to these use-cases will be super helpful and appreciated.


For anyone else reading this trying to compare the model to ada, here's a bit of discussion: https://github.com/UKPLab/sentence-transformers/issues/1897

and related benchmarking: https://huggingface.co/spaces/mteb/leaderboard

hongjin-su commented 1 year ago

Hi, Thanks a lot for your interests in the INSTRUCTOR model!

Theoretically, we can increase the maximum length to a large number, but we have not tried the INSTRUCTOR model for documents with several thousand tokens. More use cases may be posted here for further discussion!

jlia0 commented 1 year ago

@hongjin-su How large of a number that we can increase the maximum sequence length? Can we increase it without re-training?

Harry-hash commented 1 year ago

@jlia0 Thanks a lot for your comments!

You may increase the maximum sequence length a little bit without re-training, e.g., 768. However, if the sequence length is too long, you may experience low efficiency (as the transformer architecture requires O(n^2) time complexity) and slight performance drop.

Feel free to add any further questions or comments!

hongjin-su commented 1 year ago

Feel free to re-open the issue if you have any further questions or comments!