Remove code that overrides max_seq_length

max_seq_length was being hardcoded at 512 rather then being loaded via the config or via the code from the is None block. In fact the following if could not be reached

if max_seq_length is None:
            if hasattr(self.auto_model, "config") and hasattr(self.auto_model.config, "max_position_embeddings") and hasattr(self.tokenizer, "model_max_length"):
                max_seq_length = min(self.auto_model.config.max_position_embeddings, self.tokenizer.model_max_length)

The default code loads the corred sequence length of 512 so this hardcoding is not requred. I verifed this by running the inference examples, noting that the print on line 252 print('max_seq_length ',max_seq_length) shows 512.

I also verified that the simiarities produced by running the following code in the README is also identtical from the release verison on pypi

from sklearn.metrics.pairwise import cosine_similarity
sentences_a = [['Represent the Science sentence: ','Parton energy loss in QCD matter'], 
               ['Represent the Financial statement: ','The Federal Reserve on Wednesday raised its benchmark interest rate.']]
sentences_b = [['Represent the Science sentence: ','The Chiral Phase Transition in Dissipative Dynamics'],
               ['Represent the Financial statement: ','The funds rose less than 0.5 per cent on Friday']]
embeddings_a = model.encode(sentences_a)
embeddings_b = model.encode(sentences_b)
similarities = cosine_similarity(embeddings_a,embeddings_b)

array([[0.81227076, 0.7351362 ],
       [0.6770725 , 0.81411076]], dtype=float32)

Although this change probably won't effect anyone running the default version of the code (since the 512 override is the same as what is loaded) I think it's worth fixing now either by merging this PR of by the authors making the fix and verifying end-to-end.

xlang-ai / instructor-embedding

Remove code that overrides max_seq_length #54