xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.85k stars 134 forks source link

Remove code that overrides max_seq_length #54

Closed michael-quinlan closed 1 year ago

michael-quinlan commented 1 year ago

max_seq_length was being hardcoded at 512 rather then being loaded via the config or via the code from the is None block. In fact the following if could not be reached

if max_seq_length is None:
            if hasattr(self.auto_model, "config") and hasattr(self.auto_model.config, "max_position_embeddings") and hasattr(self.tokenizer, "model_max_length"):
                max_seq_length = min(self.auto_model.config.max_position_embeddings, self.tokenizer.model_max_length)

The default code loads the corred sequence length of 512 so this hardcoding is not requred. I verifed this by running the inference examples, noting that the print on line 252 print('max_seq_length ',max_seq_length) shows 512.

I also verified that the simiarities produced by running the following code in the README is also identtical from the release verison on pypi

from sklearn.metrics.pairwise import cosine_similarity
sentences_a = [['Represent the Science sentence: ','Parton energy loss in QCD matter'], 
               ['Represent the Financial statement: ','The Federal Reserve on Wednesday raised its benchmark interest rate.']]
sentences_b = [['Represent the Science sentence: ','The Chiral Phase Transition in Dissipative Dynamics'],
               ['Represent the Financial statement: ','The funds rose less than 0.5 per cent on Friday']]
embeddings_a = model.encode(sentences_a)
embeddings_b = model.encode(sentences_b)
similarities = cosine_similarity(embeddings_a,embeddings_b)

==

array([[0.81227076, 0.7351362 ],
       [0.6770725 , 0.81411076]], dtype=float32)

Although this change probably won't effect anyone running the default version of the code (since the 512 override is the same as what is loaded) I think it's worth fixing now either by merging this PR of by the authors making the fix and verifying end-to-end.

Ethan-Chen-plus commented 1 year ago

@michael-quinlan Hi! Could we change the max_length when we fine tune on our own data with long text?