xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.78k stars 131 forks source link

Example of classification inference (Zero Shot?) #79

Closed jmdetect closed 6 months ago

jmdetect commented 10 months ago

Hello,

Use of Instructor embeddings for classification is mentioned, and I have found a way to sort of do this: I include several ground truth examples for say 7 of my classes (I have a 7 multiclass problem), and then see mean cosine of a new inference sentence is when doing cosine similarity against all sample ground truths.

The problem is, this gives little control over thresholds and there are some natural example sentences which simply embed close to most other sentences (such as short ones without many nouns or unique verbs) no matter what instruction you give ("Represent as a statement" etc)

Any examples of classification (similar to SetFit) are possible?

hongjin-su commented 10 months ago

Did you mean including several ground truth examples in the instruction?

jmdetect commented 10 months ago

@hongjin-su thank you for asking about the technique

I have tried a) To put information about the classes in the instruction, but 0 examples (zero shot). Then I gather the the embeds for each of the classes, take the mean embedding, and find the closest one to infer a new sample into a specific class.

b) To put information about binary classifying as a class or not (like the positive/negative instructions found in a code sample for sentiment task), and then picked the ones closest to ground truths representing positive across each of the 7 classes at inference (e.g. I get 7 embeddings, one for each class, compare to known positive examples for each class, and the embedding for this new inference with the smallest cosine difference to a ground truth positive example of any class, becomes the predicted class)

Either approach works fairly ok (well above chance), but has big downfalls. The main one is class boundaries. Some classes simply are in the central embedding area anyhow, and for that reason dominate predictions this way, even if the actual training set has a different distribution. I have adapted to this problem before with equivalent to the code in some approaches for class priors; to weight predictions again after inference back into the training set distribution for each inference given the movement from the distribution in the test set; but this is a really not ideal solution.

Are you suggesting, put ground truth few shot examples in the instruction? How to represent the class predictions in that case? Must "Input:" be used etc? I could not find any code and the benchmark library makes it hard to understand what is sent to the model as an instruction for classification benchmarks.

Is there a way to provide how you/authors would do multi class classification?

Like SetFit?

or using SetFit with Instructor would even solve this.

hongjin-su commented 10 months ago

Hi, you may try SetFit. In the benchmark, we take the embeddings from the INSTRUCTOR, fit them into an additional classifier, e.g., KNN classifier, and then get the prediction labels.

jmdetect commented 10 months ago

Thank you for the suggestion.

KNN for classification sounds good, bit better than simply the closest.

Please can you explain what instructions you used for those embeddings - how did you instruct it about the classes? What did you put into the ["Represent this sentence as a Science Title", "Trespassing amongst Bees: The dangers of scented flowers"] format?

hongjin-su commented 10 months ago

For the purpose of classification, you may also want to include the class information, so the model will understand classification better.

hongjin-su commented 6 months ago

Feel free to re-open the issue if you have any questions or comments!