Closed jordane95 closed 11 months ago
Hi, Thanks a lot for your interest in the INSTRUCTOR!
In the MEDI data, we have one positive and one hard negative for each instance. In the training, each instance will use its own negative as hard negative, and other instance's negative in the same batch as in-batch negative. We describe the construction of MEDI datasets in the section 2.3 of our paper (including the mechanisms of hard negative mining).
In the section 2.3 and table 5 of our paper, we have also listed all the data sources used to construct the MEDI data.
Feel free to add any further questions or comments!
Please re-open the issue if you have any questions or comments!
Hi, awesome work on text embeddings!
After reading your paper and the code, I have a few questions.
As stated in the paper (Section 2.3 data construction),
But in the data downloaded from the link of your repo, each training instance from each task is accompanied with exactly 1 positive and 1 negative. Since some datasets from embedding-training-data do not contain negatives, I'm wondering how the negatives are sampled. Randomly or following the same way as superNI datasets? Also the data construction code for 300 datasets from superNI is missing.
In addition, I think the current ckpt is different from the first released ones since it's trained by hard negatives. But details about how hard negatives are sampled is missing...
Finally, many tasks are subsampled according to the paper to balance each dataset, would you mind sharing the whole data for each data source with all the hard negatives? Thanks.