xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.87k stars 135 forks source link

Example of classification inference (Zero Shot?) #79

Closed jmdetect closed 11 months ago

jmdetect commented 1 year ago

Hello,

Use of Instructor embeddings for classification is mentioned, and I have found a way to sort of do this: I include several ground truth examples for say 7 of my classes (I have a 7 multiclass problem), and then see mean cosine of a new inference sentence is when doing cosine similarity against all sample ground truths.

The problem is, this gives little control over thresholds and there are some natural example sentences which simply embed close to most other sentences (such as short ones without many nouns or unique verbs) no matter what instruction you give ("Represent as a statement" etc)

Any examples of classification (similar to SetFit) are possible?

hongjin-su commented 1 year ago

Did you mean including several ground truth examples in the instruction?

jmdetect commented 1 year ago

@hongjin-su thank you for asking about the technique

I have tried a) To put information about the classes in the instruction, but 0 examples (zero shot). Then I gather the the embeds for each of the classes, take the mean embedding, and find the closest one to infer a new sample into a specific class.

b) To put information about binary classifying as a class or not (like the positive/negative instructions found in a code sample for sentiment task), and then picked the ones closest to ground truths representing positive across each of the 7 classes at inference (e.g. I get 7 embeddings, one for each class, compare to known positive examples for each class, and the embedding for this new inference with the smallest cosine difference to a ground truth positive example of any class, becomes the predicted class)

Either approach works fairly ok (well above chance), but has big downfalls. The main one is class boundaries. Some classes simply are in the central embedding area anyhow, and for that reason dominate predictions this way, even if the actual training set has a different distribution. I have adapted to this problem before with equivalent to the code in some approaches for class priors; to weight predictions again after inference back into the training set distribution for each inference given the movement from the distribution in the test set; but this is a really not ideal solution.

Are you suggesting, put ground truth few shot examples in the instruction? How to represent the class predictions in that case? Must "Input:" be used etc? I could not find any code and the benchmark library makes it hard to understand what is sent to the model as an instruction for classification benchmarks.

Is there a way to provide how you/authors would do multi class classification?

Like SetFit?

or using SetFit with Instructor would even solve this.

hongjin-su commented 1 year ago

Hi, you may try SetFit. In the benchmark, we take the embeddings from the INSTRUCTOR, fit them into an additional classifier, e.g., KNN classifier, and then get the prediction labels.

jmdetect commented 1 year ago

Thank you for the suggestion.

KNN for classification sounds good, bit better than simply the closest.

Please can you explain what instructions you used for those embeddings - how did you instruct it about the classes? What did you put into the ["Represent this sentence as a Science Title", "Trespassing amongst Bees: The dangers of scented flowers"] format?

hongjin-su commented 1 year ago

For the purpose of classification, you may also want to include the class information, so the model will understand classification better.

hongjin-su commented 11 months ago

Feel free to re-open the issue if you have any questions or comments!