Open Mukish45 opened 1 year ago
Hi, Thanks a lot for your interest in the INSTRUCTOR!
You may follow https://github.com/HKUNLP/instructor-embedding#data to prepare the training set. The negative sentences are necessary for the training, because the model not only needs to learn to minimize the distance between positive pairs, but also needs to learn to maximize the distance between negative pairs.
@hongjin-su Thank you for the clear explanation. I have one more doubt, can we make the model to run on multi-threads. Because it takes 1 second to encode 2 sentences. I want to increase its encoding speed. If there is a way, please let me know.
An easy way to achieve the same effect would be to split the data. If you split the data into different pieces, then you can encode them separately without considering the communications between different threads.
Thanks a lot for open sourcing your excellent job. I would like to further fine-tune your model for comparing two sentences and getting their similarity scores. You guys made the MEDI dataset as a general format for retrieving, pairwise classification, clustering etc with {query, pos, neg, task_name}. For my need, I want to compare two sentences by encoding and finding cosine similarity. Then what should be the format for my training set? I think neg sentences might not need for this(If needed, why?).
Please assist me with this. Thank you