xlang-ai / instructor-embedding

[ACL 2023] One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Apache License 2.0
1.85k stars 134 forks source link

Fine-tuning for sentence comparison #69

Open Mukish45 opened 1 year ago

Mukish45 commented 1 year ago

Thanks a lot for open sourcing your excellent job. I would like to further fine-tune your model for comparing two sentences and getting their similarity scores. You guys made the MEDI dataset as a general format for retrieving, pairwise classification, clustering etc with {query, pos, neg, task_name}. For my need, I want to compare two sentences by encoding and finding cosine similarity. Then what should be the format for my training set? I think neg sentences might not need for this(If needed, why?).

Please assist me with this. Thank you

hongjin-su commented 1 year ago

Hi, Thanks a lot for your interest in the INSTRUCTOR!

You may follow https://github.com/HKUNLP/instructor-embedding#data to prepare the training set. The negative sentences are necessary for the training, because the model not only needs to learn to minimize the distance between positive pairs, but also needs to learn to maximize the distance between negative pairs.

Mukish45 commented 1 year ago

@hongjin-su Thank you for the clear explanation. I have one more doubt, can we make the model to run on multi-threads. Because it takes 1 second to encode 2 sentences. I want to increase its encoding speed. If there is a way, please let me know.

hongjin-su commented 9 months ago

An easy way to achieve the same effect would be to split the data. If you split the data into different pieces, then you can encode them separately without considering the communications between different threads.