xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.78k stars 131 forks source link

Inconsistency between the instruction template suggested vs that in the training data #96

Open debraj135 opened 8 months ago

debraj135 commented 8 months ago

I noticed that the instructions in the training data end with ; and no whitespace after that.

For example 'Represent the Science sentence;' instead of 'Represent the Science sentence: '

Whereas in the readme, the proposed format seems to be 'Represent the Science sentence: ' sometimes and 'Represent the Science sentence:' in other places.

All of these three seem to be resulting in different embeddings and hence different similarity numbers. Can you please let us know what is the right instruction template?

debraj135 commented 7 months ago

Wondering if I'm missing a detail. Did anyone else also come across this?

hongjin-su commented 7 months ago

Thanks a lot for your interest in the INSTRUCTOR!

Like other LLMs, the INSTRUCTOR is sensitive to the instructions, which may be worsened by its small size. I would say all of your proposed instructions follow the basic templates, while we may need more trials or heuristics to figure out the best instruction.

debraj135 commented 7 months ago

Thank you. I had a few follow up questions

  1. Did the instruction that was provided as an input to the model during training have a semicolon ; or a colon : at the end of the instruction?
  2. Which of the above three templates that I previously mentioned were used in performing the evaluation on the MTEB leaderboard?
debraj135 commented 6 months ago

Following back on this.

hongjin-su commented 6 months ago

Sorry for the late reply!

In our training and evaluation, we may not be very strict on punctuation. We are glad to make it more consistent in our future versions!