Closed gd1m3y closed 1 year ago
Hi, Thanks a lot for your question!
By default, the maximum sequence length is 512. You can change that using the attribute max_seq_length
. For example:
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
model.max_seq_length = 256
print(model.max_seq_length)
Hope this helps! Feel free to leave any further comments or questions!
Het thankyou for your answer although i can see that changing the max_seq_length parameter doesnt seem to reflect on the output for ex if i set it to 0 it will still give me a embedding of (1,768). or even i set it to a no like 4096 it doesnt seem to reflect it on the output.
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
model.max_seq_length = 256
print(model.max_seq_length)
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings.shape)
Hi, Thanks a lot for your question!
Our INSTRUCTOR calculates the sentence embedding, which is the average of token embeddings in the input text. For the embedding shape (1,768) in your example, 1 refers to the number of sentences you encode, and 768 refers to the embedding dimension.
To inspect the actual length of sequence being encoded, you may print out the shape of token_embeddings here: https://github.com/HKUNLP/instructor-embedding/blob/main/InstructorEmbedding/instructor.py#L103.
Tips:
To print out the shape of token_embeddings, one way is to install the InstructorEmbedding
package from the source via
pip install -e .
and add the following code after line 103, i.e., https://github.com/HKUNLP/instructor-embedding/blob/main/InstructorEmbedding/instructor.py#L103
print("The shape of token embeddings: ", token_embeddings.shape)
Then you will be able to see the sequence length.
Hope this helps! Feel free to add any following question or comment!
Feel free to re-open this issue and add any following comments!
Feel free to re-open this issue and add any following comments!
@Harry-hash does this mean that InstructOR can accept sequences of any length, because everything is mean pooled into the embedding dimension in the end? It feels like there should still be a limit governed by the model's architecture or computational resources.
Hi, thanks a lot for your comments!
Theoretically, the model can embed sequences of any length. However, since the model is not particularly pre-trained on long sequences, the performance may drop significantly upon extremely long inputs. In addition, due to the O(n^2) computational complexity inside transformer model, the efficiency may also drops as the input length increases. Therefore, upon extremely long sequences, e.g., over 10k tokens, it may be suggested to first chunk texts before calculating embeddings with the INSTRUCTOR model.
Hope this helps!
@Harry-hash very helpful, thanks for the quick response!
Feel free to re-open the issue if your have any questions or comments!
Feel free to re-open this issue and add any following comments!
@Harry-hash does this mean that InstructOR can accept sequences of any length, because everything is mean pooled into the embedding dimension in the end? It feels like there should still be a limit governed by the model's architecture or computational resources.
Can someone explain to me why this ^ does not work then? It may not be pre-trained on long sequences, but you can always chunck them and since you are going to average the embeddings afterwards it shouldn't make a difference? What am I missing here?
@OneCodeToRuleThemAll if you take entire books and average them into the average of their tokens, it's going to trend toward the same thing (the frequecy of words in the language). So the weights are tuned/trained in a particular distribution of values, and averaging way more tokens will shift the distribution from the distribution of the typical few tokens in the training set toward the distribution of tokens in the entire language.
For a totally different way to think about it, you can mix colors randomly. and maybe the model is trained to find the different colors that were mixed. but if you mix more and more colors, it will trend toward the same thing, so that all your documents will have similar values for their token embedding.
@AwokeKnowing
@OneCodeToRuleThemAll if you take entire books and average them into the average of their tokens, it's going to trend toward the same thing (the frequecy of words in the language). So the weights are tuned/trained in a particular distribution of values, and averaging way more tokens will shift the distribution from the distribution of the typical few tokens in the training set toward the distribution of tokens in the entire language. For a totally different way to think about it, you can mix colors randomly. and maybe the model is trained to find the different colors that were mixed. but if you mix more and more colors, it will trend toward the same thing, so that all your documents will have similar values for their token embedding.
Thank you for the explanation. Follow up question: Averaging way more tokens will shift the distribution of the typical few tokens in the training set toward the distribution of tokens in the entire language. That is I'm guessing if the 'way more tokens' number is really Large. Let's say we have a 5-10 pages of text. If we want to add more text/context you say that by adding 20-30 books we are going to shift from the typical distribution to the distribution of the entire language and everything will be then similar and we won't be able to distinguish.
But what if there is a golden spot (this is an idea) to add maybe 1-2 books or +10 or 20 pages to increase the text so as to not shift these values so much but still be able to shift them in order to capture the meaning of the extra pages. Is what I'm saying making any sense? If you have any papers/anything I can read about this let me know, I'm interested.
@Harry-hash, sorry to reopen this topic with what I think is a basic question: is the sequence length under discussion here in characters, words, tokens, or some other unit? E.g., when splitting text, should we be splitting it into chunks less than or equal to 512 characters, words, tokens, etc.?
Thanks for the detailed answers to date!
I think I can answer the above from line 242 in instructor.py
. max_seq_length
is passed to AutoTokenizer
, and in the context of AutoTokenizer
the maximum length is in terms of tokens, so max_seq_length=512
limits the total number of tokens (not words, characters, etc.) to 512.
i can see that the maximum input length is set to 512 how can i change that ? and is seq length more than 512 supported ? what is the max seq length supported